The Internet is an international environment, and Internet mail must reflect that very basic fact. Historically, Internet mail software has not kept up with the technologies available to it for making mail useful to all end-users, regardless of the languages they speak and the kinds of characters they use. The goal of this report is to show how international characters are used today and should be used in the near future in mail user agents.
It is important to note that this report does not create any new standards; instead, it shows how current and proposed standards should be used. Most of the issues in internationalization already have standards, and those issues that are not are already being addressed in standards bodies such as the IETF and the Unicode Consortium. However, in many cases, there are too many standards, and developers are waiting to see which standard becomes dominant. One of the main purposes of this report is to help break this logjam.
Internationalization raises many problems, most of which have already been identified in other organizations. One of the biggest problems in the area of internationalization is that not all problems are universal. For example, many scripts do not have diacritic marks such as accents and cedillas, and people whose native scripts have no diacritic marks tend to not consider the problems of using them; similarly, many scripts do not have any concept of capital letters, and people whose native script doesn't have capital letters have less understanding of where these characters can cause problems.
Some of the most important problems of internationalization in Internet mail are covered in this report. They include:
- Allowing a sender to create a message and control information in one or more desired character sets
- Displaying a message and its control information using the correct character sets
- Including language information in messages and control information
- Interacting with older mail software that has weak or non-standard internationalization capabilities
- Displaying useful information for characters that cannot be directly displayed
Other important topics that are not covered in this report include using international characters in client commands and human-readable responses in the SMTP, POP, and IMAP protocols, internationalization in mailbox names and sorting in IMAP, and international characters in digital certificates used with Internet mail.
Finally, this report does not cover what current implementations do for internationalization. Instead, a separate report may be made on this topic.
This report describes the internationalization issues when creating and displaying Internet mail messages. For some issues, there are specific recommendations for what a program should do to facilitate the best possible results; these are marked as Recommendations. For other issues, only general suggestions are made.
The language used in discussions of internationalization is often strange to people outside the field. New terms and new definitions for existing terms have been invented because you need to use precise language when discussing a field that is full of subtleties and shifting meanings.
This section covers only a few keywords and phrases. All readers of this report would do well to read the first five chapters of the Unicode Standardif for no other reason than to get a handle on the terminology used in the discussion of internationalization. Most people, on reading the Unicode Standard for the first time, remark that they had no idea that there were so many variations on how characters are formed, how they are displayed, and so on. Many of the definitions used in this report come straight from the Unicode Standard. Although the term "charset" has its origins as an abbreviation for "character set", a charset is not the same thing as a character set. Internet protocols generally use charsets, not character sets. The definitions of the various terms associated with charsets and characters are:
- A characteris the smallest component of written language that has semantic value. A character has a single abstract meaning and/or shape, but not a specific shape.
- A glyphis a specific shape that a character can have when it is rendered or displayed. A single glyph may correspond to a single character, or it may correspond to many characters; for example, the same glyph is used to represent the Latin capital letter "P" and the Greek capital letter "Rho". Similarly, a single character may correspond to multiple glyphs due to font, formatting style, national differences, and other reasons.
- A character set(more precisely called a "coded character set" or "CCS") is a mapping from a set of abstract characters to a set of integers. Examples of coded character sets include ISO 10646, US-ASCII, and the ISO 8859 series.
- A character encoding schemeor "CES" is a mapping from one or more coded character sets to a set of octets. Some CESs are associated with a single CCS; for example, UTF-8 applies only to ISO 10646. Other CESs, such as ISO 2022, are associated with many CCSs.
- Finally, a charsetis a method of mapping a sequence of octets to a sequence of abstract characters. A charset is, in effect, a combination of one or more CCS with a CES. Charset names are registered by the IANA according to procedures documented in RFC 2278.
In some cases, a charset name matches a CCS or CES name. This is often the case when a particular CCS implies a single specific CES or vice versa. For example, the UTF-8 CES is only used with the ISO 10646 CCS, and consequently the registered charset name for CCS="ISO 10646" CES="UTF-8" is "UTF-8".
Some charsets are complex. Typical of this is the ISO-2022-JP charset, which is CCS="ASCII","JIS X 0201 left half","JIS X 0208" and CES="ISO 2022". Some charsets based on ISO 2022 are even more complicated than this.
A language is a way that humans interact. In written form, language is expressed in characters. The same set of characters can often be used in many languages, and many languages can be expressed using different scripts. A particular charset may have different glyphs (shapes) depending on the language being used.
A mail user agent (MUA) is the software a user runs to send and receive Internet mail. A mail transport agent (MTA) is the software that moves the mail over the Internet. MUAs are often called "mail clients" and MTAs are often called "mail servers", but those terms do not correctly reflect the actions of the software.
Most MUAs are used by people, although there are certainly many MUAs that are automatic processes such as mailing list systems. For the purposes of this report, an MUA has four basic functions:
- Composing messages
- Submitting messages to an MTA so that it is sent to the recipient
- Accessing messages that have been sent to the user
- Displaying messages
Step 2 involves the MUA acting as a client in the SMTP protocol. Step 3 involves the MUA acting as a client in the IMAP or POP protocols; the MUA can also use some other method of reading mail from the message store. Steps 1 and 4 do not involve Internet protocols, but rely heavily on Internet formats such as RFC 822 headers and MIME, as well as other Internet standards.
MTAs on the Internet run the SMTP protocol. MTAs mostly just move messages. The MTA that is responsible for receiving messages for the recipient accepts messages and saves them in a message store that the recipient has access to. This report only covers MUAs, not MTAs.
Because of the importance of internationalization, there are plenty of standards. Unfortunately, not all of the standards have been implemented by MUA developers. This section lists tha main standards and proposals that implementors should know about.
The "IETF Policy on Character Sets and Languages", RFC 2277, lists the suggested practices for standards within the IETF that deal with internationalization. It is an excellent document for those wondering why the IETF does one thing or another with respect to internationalization. It covers topics such as when a protocol must have internationalization, what charsets to use, how to do language tagging, and so on. Note that RFC 2277 prescribes how protocols should implement internationalization, not what individual products that use the protocols should do. RFC 2277 specifies that protocols must be able to use the UTF-8 charset for all text. The UTF-8 charset is defined in the Unicode Standardas well as in RFC 2279. RFC 2277 also specifies that protocols that transfer text must provide for carrying information about the language of that text; protocols should also provide for carrying information about the language of names, where appropriate. It recommends the use of language tagging as defined in "Tags for Language Names", RFC 1766. Internationalized text and names appear in both parts of Internet mail messages: in the headers, and in the body of the message. RFC 2047covers how to use international characters in some parts of non-MIME headers, such as addresses and subject headers. RFC 2231extends the concepts in RFC 2047 to cover MIME parameter values, and also specifies a method for using language tagging for international characters in these headers. RFC 2046describes how to specify the charset for each part of the body of the message. Because charsets are so important to Internet standards, new charsets can be registered so that applications can refer to them. RFC 2278defines what charsets are and how they can be registered with IANA. It should be noted that the Unicode Standard also defines the UTF-7 charset, which was intended for Internet mail. However, MIME is quite capable of carrying UTF-8, and UTF-8 is expected to be used in many protocols, not just Internet mail. Fortunately, very few vendors implemented UTF-7, and its use is strongly discouraged in Internet mail.
ISO was historically the home of standards for internationalization, although that is no longer the case. However, many current non-ISO standards refer to earlier ISO standards. For example, RFC 1766 relies on ISO 639for the names of the languages used (although it also specifies its own extension mechanism), and also refers to ISO 3166for the names of countries. Similarly, RFC 2046 defines two sets of charsets: US-ASCII, and ISO-8859-X, which refers to the various parts of ISO 8859. Other IETF standards refer to ISO 2022for some character sets. More recently, ISO and the Unicode Consortium have created a single large character set that encompasses essentially all of the characters from all living languages (and many defunct languages as well). This character set is specified in ISO/IEC 10646 and in the Unicode Standard. However, the Unicode Standard goes much further than ISO/IEC 10646, and gives semantics to the characters, categorizes them, has many useful rules for handling them, and imposes tighter compliance requirements to guarantee the same behavior on different platforms.
As described earlier, the current IETF practice for protocols is to use the UTF-8 charset, which maps to the characters in the Unicode Standard and ISO 10646. UTF-8 comes from the Unicode Standard and ISO/IEC 10646, although the definition of UTF-8 that is used in Internet protocols comes from RFC 2279. The Unicode Consortium has defined a way to label the language of the text that is encoded in the Unicode Standard. The document that defines this tagging is Unicode Technical Report #7: Plane 14 Characters for Language Tags. These tags can be used to switch languages within a single block of text; this differs from the MIME tagging defined in RFC 1766, which defines a single language for an entire body part. This kind of embedded tagging is most useful for multi-language text. Internet mail messages are created by humans and by computer programs (usually MUAs). The input needs for humans and computer programs differ, of course, even though the output should be the same. For example, a computer program doesn't need to know how to "type" a particular character in a message.
Each body part can have only one charset. As described in RFC 2045, the charset is specified as a parameter of the Content-type header field. RFC 2046 defines the default charset, if none is specified, to be US-ASCII.
For example, a Content-type header might be:
Recommendation:All body parts that include human-readable text and are created with a Content-type header should include an explicit charset parameter, even if the charset is US-ASCII.
There is a strong tendency in the IETF to start using the UTF-8 charset as soon as possible. Of course, there is always a "who starts first" problem with adopting any new technology, but there is general agreement that this is quite important.
Recommendation:All mail-creating programs created or revised after January 1, 1999, must be able to create mail using the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot create mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating programs should try to meet this requirement as early as possible.
At the time this report is being written, few mail-displaying programs support the UTF-8 charset. Thus, it is not recommended that mail-creating programs should immediately start sending only UTF-8. There are dozens of charsets in wide use throughout the world in currently-deployed MUAs. However, the use of the UTF-7 charset is strongly discouraged.
Mail recipients expect to receive mail that they can view (although most people have gotten used to getting some messages that they cannot view due to charsets that their viewing program does not handle). Thus, senders still need to be able to control the charset that is used when creating messages.
Recommendation:All mail-creating programs that are controlled by humans should allow the sender to choose the charset used to create a message. These programs should also give advice to the user about the different charsets, such as about the likelihood that the recipient will be able to display a particular charset.
Of course, guessing what charset to use for a recipient with unknown capabilities is quite difficult. Even if the recipient has sent a message, the sender cannot assume that the charset used in that message is the best charset they can use. This is due to the problem of neither side knowing whether or not they can escalate to a more capable charset first.
Most MUAs have a "default" charset they use for messages. This default might be set based on a number of factors, including the country of origin of the software, the location of the user, settings from the operating system on which the software is being run, and so on. Because the user often knows the capabilities of most of the recipients of mail they send, the user should be able to set the default charset used in new messages.
Humans reading a text message can usually determine the language of the message from the context. However, as message-displaying programs get more sophisticated, they may take advantage of additional language information supplied in messages. For example, a display program might choose slightly different glyph representations for Chinese or Japanese characters based on the language of the text. A program that hyphenates text may use language information to make better choices for particular words. A system that reads a message out loud instead of displaying it on a screen would also clearly benefit from knowing the language of the text. However, in all these cases, the reader should be allowed to specify whether to use the additional language information, or just treat the text as in the reader's native language.
RFC 1766 defines how to create language tags for message body parts, and Unicode Language Tags describes how to mark the language of the text that uses the Unicode Standard. Many programs process MIME language information, but Unicode language tags are very new and are handled by very few agents so far. Thus, every mail body part should have a Content-language header if possible, and parts that have more than one language should use UTF-8 and Unicode language tags.
Recommendation:All body parts that are created with a Content-type header that includes human-readable text should also include a Content-language header. This practice makes it more likely that programs that process messages where different languages would process differently will process them correctly. Note that the MIME media type does not define whether or not the content is human-readable, and the Content-language header should be used with all types of human-readable content, not just plain text.
Recommendation:All plain text body parts that use UTF-8 and have more than one language should use Unicode Language Tags in addition to a Content-language header. However, Unicode Language Tags should only be used with plain text body parts that have more than one language; they should not be used with body parts that have a single language, nor should they be used with structured text body parts such as those coded with HTML.
Recommendation:All mail-creating programs should allow users to use non-ASCII characters in message headers, as described in RFC 2047 and RFC 2231. Headers that conform to these two RFCs are not known to harm any mail-displaying process that does not conform to the RFCs. The charsets used in these headers should be chosen using similar methods to choosing charsets for the bodies of the messages to which they are attached.
Internet mail messages are processed by computer programs before they are displayed to humans or further processed by other programs. Displaying a message with international content involves two basic steps: understanding the charset in the message and displaying that charset. It is important to note that these two steps are independent, in that a mail-displaying program might understand a particular charset without being able to display all the glyphs represented in that charset. In fact, it is expected that few programs that handle UTF-8 for creation or display will be able to display all characters in the charset.
There is a strong tendency in the IETF to start using the UTF-8 charset as soon as possible. Of course, there is always a "who starts first" problem with adopting any new technology, but there is general agreement that this is quite important.
Recommendation:All mail-displaying programs created or revised after January 1, 1999, must be able to display mail that uses the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot display mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-displaying programs should try to meet this requirement as early as possible. As noted above, programs that display UTF-8 do not have to display all possible UTF-8 characters. In fact, it is likely that only a few such programs will exist, mostly due to display restrictions of the operating systems on which the mail programs run. Therefore, mail-displaying programs should have a method for displaying characters in messages that they can't represent by the correct glyph.
This report does not make any recommendations on how to represent undisplayable characters. However, any mail-displaying program that can understand a charset that it cannot fully display should have some reasonable method for showing undisplayable characters. This might be to use a single glyph that represents every undisplayable character, or it might be to show the underlying encoding for each undisplayable character, or some other method. The Unicode Standard contains a great deal of information on undisplayable characters, and additional suggestions on handling undisplayable characters can be found in section 5.4 of the HTML 4.0specification from the W3C. Although the vast majority of this report is about MUAs, MTAs can also play a role in internationalization. Many MUAs expect to be able to send messages with more than just 7-bit data in them, and MTAs should be able to transmit these messages unaltered. RFC 1652, which describes the ESMTP 8BITMIME extension, has been implemented in most popular SMTP servers and should be used wherever possible. This extension makes UTF-8 much more efficient, and has many related benefits as well.
Recommendation:All SMTP servers should support the 8BITMIME extension, as described in RFC 1652.
In order for Internet mail to handle internationalization well, most or all of the major products must support internationalization well. Fortunately, many of the largest mail vendors are planning to release fully internationalized MUAs in the near future, and it is likely that other vendors will follow soon due to market pressures.
One concern that many people outside the US have is that MUAs will send and receive UTF-8, but only handle the portion of UTF-8 that overlaps with US-ASCII. Others worry that some MUAs will support only the characters in the iso-8859-1 charset and claim that they handle international characters. Clearly, it is difficult for an MUA to handle characters beyond what the operating system it is running under can show. As more and more MUAs start sending UTF-8, however, there will be a world-wide expectation that recipients will be able to view messages. Thus, every MUA maker should not only try to handle UTF-8, but should also work hard to display as many characters within that charset as possible.
Because the Unicode Standardcovers almost every known character used anywhere in the world today, it makes a good "central platform" for internationalization. The Unicode Consortium has freely-available transcoding tables for all common charsets to and from the Unicode Standard. This means that software that uses the Unicode Standard as its core character set can transcode to and from any common charset easily using the transcoding tables. It also means that all software that uses these mapping tables will convert from one charset to another in an identical fashion. Recommendation:All mail-creating and mail-displaying programs created or revised after January 1, 1999, should be able to handle many common charsets in addition to UTF-8. Another way to say this is that any mail-creating and mail-displaying program created or revised after January 1, 1999, that cannot handle a wide variety of common charsets should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating and mail-displaying programs should try to meet this requirement as early as possible.
The following are duplicates of the recommendations from earlier parts of the report.
- Explicit charset parameter: All body parts that include human-readable text and are created with a Content-type header should include an explicit charset parameter, even if the charset is US-ASCII.
- Sending UTF-8: All mail-creating programs created or revised after January 1, 1999, must be able to create mail using the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot create mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating programs should try to meet this requirement as early as possible.
- Choosing charsets on creation: All mail-creating programs that are controlled by humans should allow the sender to choose the charset used to create a message. These programs should also give advice to the user about the different charsets, such as about the likelihood that the recipient will be able to display a particular charset.
- Specifying languages: All body parts that are created with a Content-type header that includes human-readable text should also include a Content-language header. This practice makes it more likely that programs that process messages where different languages would process differently will process them correctly. Note that the MIME media type does not define whether or not the content is human-readable, and the Content-language header should be used with all types of human-readable content, not just plain text.
- Multi-language text: All plain text body parts that use UTF-8 and have more than one language should use Unicode Language Tags in addition to a Content-language header. However, Unicode Language Tags should only be used with plain text body parts that have more than one language; they should not be used with body parts that have a single language, nor should they be used with structured text body parts such as those coded with HTML.
- Non-ASCII headers: All mail-creating programs should allow users to use non-ASCII characters in message headers, as described in RFC 2047 and RFC 2231. Headers that conform to these two RFCs are not known to harm any mail-displaying process that does not conform to the RFCs. The charsets used in these headers should be chosen using similar methods to choosing charsets for the bodies of the messages to which they are attached.
- Displaying UTF-8: All mail-displaying programs created or revised after January 1, 1999, must be able to display mail that uses the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot display mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-displaying programs should try to meet this requirement as early as possible.
- MTAs and 8-bit content: All SMTP servers should support the 8BITMIME extension, as described in RFC 1652.
- Handling all common charsets: All mail-creating and mail-displaying programs created or revised after January 1, 1999, should be able to handle many common charsets in addition to UTF-8. Another way to say this is that any mail-creating and mail-displaying program created or revised after January 1, 1999, that cannot handle a wide variety of common charsets should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating and mail-displaying programs should try to meet this requirement as early as possible.