| |
| Version | 5.2.0 (draft 5) |
| Authors | Mark Davis (markdavis@google.com), Michel Suignard |
| Date | 2009-11-12 |
| This Version | http://www.unicode.org/reports/tr46/tr46-2.html |
| Previous Version | http://www.unicode.org/reports/tr46/tr46-1.html |
| Latest Version | http://www.unicode.org/reports/tr46/ |
| Revision | 2 |
This document provides a specification for processing that provides for compatibility between older and newer versions of internationalized domain names (IDN) for lookup in client software. It allows applications such as browsers and emailers to be able to handle both the original version of internationalized domain names (IDNA2003) and the newer version (IDNA2008) compatibly, avoiding possible interoperability and security problems.
[Review Note: At this point, IDNA2008 is still in development, so this draft may change as IDNA2008 changes.]
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
One of the great strengths of domain names is universality. With http://Apple.com, you can get to Apple's website no matter where you are in the world, and no matter which browser you are using. With markdavis@google.com, you can send an email to an author of this specification, no matter which country you are in, and no matter which emailer you are using.
Initially, domain names were restricted to only handling ASCII characters. This is was a significant burden on people using other characters. Suppose, for example, that the domain name system had been invented by Greeks, and one could only use Greek characters in URLs. Rather than apple.com, one would have to write something like αππλε.κομ. An English speaker would not only have to be acquainted with Greek characters, but would also have to pick those Greek letters that would correspond to the desired English letters. One would have to guess at the spelling of particular words, because there are not exact matches between scripts.
A large majority of the world’s population faced this situation until recently, because their languages use non-ASCII characters.
A system was introduced in 2003 for internationalized domain names (IDN). This system is called Internationalizing Domain Names for Applications, or IDNA for short. It consists of a series of RFCs collectively known as IDNA2003 [IDNA2003]. This system allows non-ASCII Unicode characters, which includes not only the characters needed for Latin-script languages other than English (such as Å, Ħ, or Þ), but also different scripts, such as Greek, Cyrillic, Tamil, or Korean.
The IDNA mechanism for allowing non-ASCII Unicode characters in domain names involves applying the following steps to each label in the domain name that contains Unicode characters:
For example, you can now type in http://Bücher.de into the address bar of any modern browser, and you will go to a corresponding site, even though the "ü" is not an ASCII character. This works because the IDN resolves to the Punycode string which is actually stored by the DNS for that site. Similarly, when a browser interprets a web page containing a link such as <a href="http://Bücher.de">, the appropriate site is reached. (In this document, when phrasing like "a browser interprets" is used, it refers both to domain names parsed out of URLs entered in an address bar and to those contained in links internal to HTML text.)
In this case, for the IDN Bücher.de, the Punycode value actually used for the domain names on the wire is http://xn--bcher-kva.de. The Punycode version is also typically transformed back into Unicode form for display. The resulting display string will be a string which has already been mapped according to the IDNA2003 rules. So in this example we end up with a display string that has been casefolded to lowercase:
http://Bücher.de → http://xn--bcher-kva.de → http://bücher.de
There is a new version of IDNA under development. This version also consists of a collection of RFCs and is usually called IDNA2008 [IDNA2008]. The "2008" in that term does not reflect the actual date of approval, which is still pending and expected to occur in late 2009 or early 2010.
[Review note: reword above when 2008 is released.]
For the most common cases, the processing in IDNA2003 and IDNA2008 are identical. Both transform a Unicode domain name in a URL (like http://öbb.at) to the Punycode version (like http://xn--bb-eka.at). However, IDNA2008 does not maintain strict backwards compatibility with IDNA2003.
The main differences between the two are:
For more detail on the differences, see Section 8, IDNA Comparison.
The cases of deviations and unpredictable changes introduced by the differences between IDNA2008 and IDNA2003 may cause both interoperability and security problems. They affect extremely common characters, such as all uppercase characters, all half-width or full-width characters (commonly used in Japan, China, and Korea), and certain other characters like the German eszett (U+00DF ß LATIN SMALL LETTER SHARP S) and Greek final sigma (U+03C2 ς GREEK SMALL LETTER FINAL SIGMA).
IDNA2003 requires a mapping phase, which maps http://ÖBB.at to http://öbb.at, for example. Mapping typically involves mapping uppercase characters to their lowercase pairs, but it also involves other types of mappings between equivalent characters, such as mapping half-width katakana characters to normal (full-width) katakana characters in Japanese. The mapping phase in IDNA2003 was included to match the insensitivity of ASCII domain names. Users are accustomed to having both http://CNN.com and http://cnn.com work identically. They would not expect the addition of an accent to change the casing behavior: they expect that if http://Bruder.com is the same as http://bruder.com, then of course http://Brüder.com is the same as http://brüder.com. In other scripts there are variations in characters similar to case in this respect. The IDNA2003 mapping is based on data specified by Unicode; this mapping was later formalized as the Unicode property [NFKC_CaseFold].
IDNA2008 does not require a mapping phase, but does permit one (called "Local Mapping" or "Custom Mapping"), with no limitation on what the mapping can do to disallowed characters. Disallowed characters even include ASCII uppercase characters, if they occur in an IDN label. For more information on the permitted mappings, see the Protocol document of [IDNA2008], Section 4.2 Permitted Character and Label Validation and Section 5.2 Conversion to Unicode. An implementation of IDNA2008 which uses the option of Custom Mapping can, in principle, allow any particular mapping. Such mappings can have unpredictable results regarding the exact interpretation of the processed IDNs. For example, the following mappings show cases where IDNs are mapped to what would be considered completely different domain names by IDNA2003 rules:
[Review note: Fix the numbers/titles for the Protocol document if they change before 2008 is released.]
Note that there is a dotless i in the result of the mapping illustrated in #4. This has the consequence that the mapped IDN resolves to a different location than the mapped IDN in #3.
IDNA2008 does define a particular mapping. That mapping is not normative, and does not attempt to be compatible with IDNA2003. For more information, see the Mapping document in [IDNA2008].
There are a few situations where the strict application of IDNA2008 will result in the resolution of IDNs to different IP addresses than in IDNA2003, unless the registry or registrant takes special action. This affects a relatively small number of characters, but these characters are common in particular languages. Because of this common occurrence, a significant number of strings for domain names are affected in those languages. This set of characters is referred to as "Deviations". There are four of these, as shown in Table 1, Deviation Characters.
For more information on the rationale for the the occurrence of these Deviations in IDNA2008, see the [IDN FAQ].
The differences in interpretation of Deviation characters results in the potential for security exploits. Consider a scenario involving http://www.sparkasse-gießen.de, a German IDN for "Gießen Savings and Loan".
Alice ends up at the phishing site, supplies her bank password, and is her money is stolen. While the .DE registar (DENIC) might have a policy about bundling all of the variants of ß together (so that they all have the same owner) it is not required of registries. It is quite unlikely that all registries will have or enforce such a bundling policy in all such cases.
There are two Deviations of particular concern. IDNA2008 allows ZWJ and ZWNJ characters in labels. By contrast these are removed by the mapping in IDNA2003. In addition to this difference in mapping, these characters represent a special security concern because they are normally invisible. That is, the sequence "a<ZWJ>b" looks just like "ab". IDNA2008 provides a special category called CONTEXTJ for ZWJ and ZWNJ, and only permits them to occur in certain contexts: certain sequences of Arabic or Indic characters. However, lookup applications are not required to check for these contexts, so overall security is dependent on registries having correct implementations. Moreover, those context restrictions do not catch all cases where distinct domain names have visually confusable appearances because of ZWJ and ZWNJ.
To allow client-side applications to work around the incompatibilities between IDNA2003 and IDNA2008 for lookup, this document provides a Unicode algorithm for a standardized processing that allows conformant implementations to minimize the security and interoperability problems caused by the differences between IDNA2003 and IDNA2008. This Unicode IDNA Compatibility Processing is structured according to IDNA2003 principles, but extends those principles to Unicode 5.1 and later. In so doing, it also incorporates the repertoire extensions provided by IDNA2008.
The Unicode IDNA Compatibility Processing uses the standard Unicode mapping, [NFKC_CaseFold], for the mapping described in this document. As a result, the domain name in http://ÖBB.at is valid, and maps to http://öbb.at. It also allows domain names as in http://√.com (which has an associated web page), and are allowed in IDNA2003. Based on security considerations, implementations may restrict or flag (in a UI) domain names that include symbols and punctuation. For more information, see UTR#36: Unicode Security Considerations [UTR36].
The result of this Compatibility Processing is a series of labels, each separated by U+002E ( . ) FULL STOP. For DNS lookup, the result of the Compatibility Processing is transformed by Punycoding each label that contains non-ASCII.
Using the Unicode IDNA Compatibility Processing to transform an IDN into a form suitable for DNS lookup is comparable to the tactic of "try IDNA2008 then try IDNA2003". However, this approach avoids a dual lookup, which can be very problematic. It allows browsers and other clients such as search engines to have a single processing step, without having to maintain two different implementations and multiple tables. It accounts for a number of edge cases that would cause problems, and provides a stable definition with predictable results that will remain absolutely backwards compatible in future versions of Unicode.
For a demonstration of differences between IDNA2003, IDNA2008, and the Unicode IDNA Compatibility Processing, see the [IDN_Demo].
This document provides a compatibility mechanism for dealing with IDNA domain name lookup and display. To this end, specifies two specific types of processing: Lookup Processing and Display Processing. It does not deal with the registration of IDNs.
Note that neither the Unicode IDNA Compatibility Processing nor IDNA2008 address security problems associated with confusables (the so-called "paypal.com" problem). IDNA2008 does disallow certain symbols and punctuation characters that can be used for spoofing, such as spoofs of the slash character ("/"). These are, however, an extremely small fraction of the confusable characters used for spoofing. Moreover, confusable characters themselves account for a small proportion of fishing problems: most are cases like "secure-wellsfargo.com". For more information, see [Bortzmeyer].
It is strongly recommended that UTR#36: Unicode Security Considerations [UTR36] be consulted for information on dealing with confusables.
For IDNA2003 applications, it has been customary to display the processed string to the user. This is helpful for security, because it reduces the opportunity for visual confusability. Thus, for example, http://googIe.com (with a capital I in place of the L) is revealed as http://googie.com. However, for the case of the Deviations, the distinction between the original and processed form is especially important for users. Thus in displaying domain names, it is recommended that the Display Processing be applied. This is the same as Lookup Processing, except that it excludes the deviations: ß, ς, and joiners.
Labels presented to a browser may or may not be in the display form preferred by a target site; for more information see the [IDN FAQ]. This specification defines a default display algorithm in Section 4, Processing.
This specification is primarily targeted at applications doing Lookup Processing for IDNs. There is, however, one strong recommendation for registries: do not allow the registration of labels that are invalid according to Lookup Processing. The registration of such a label would not be found by browsers and search engines following Unicode IDNA Compatibility Processing.
The label that is actually registered and inserted into a registry, is always a label that has been processed. For example, http://xn--bcher-kva.de which corresponds to http://bücher.de. However, it may be useful for a registry to also ask for "unprocessed" labels as part of the registration process, such as http://Bücher.de, so that they are aware of the registrant's intent. However, such unprocessed labels must be handled carefully:
Sets of code points are defined using properties and the syntax of UTS#18: Unicode Regular Expressions [UTS18]. For example, the set of combining marks is represented by the syntax \p{gc=M}. An additional syntactic notation beyond the syntax of UTS#18 is used here: the "+" indicates the addition of elements to a set.
In this document, a label is a substring of a domain name. That substring is bounded on both sides by either the start or the end of the string, or any of the following characters, called label-separators:
The requirements for conformance on implementations of the Unicode IDNA Compatibility Processing algorithm are as follows:
| C1 | Given a version of Unicode and a Unicode String, a conformant implementation of Lookup Processing shall replicate the results given by applying the Lookup Processing algorithm specified by Section 4, Processing. |
| C2 | Given a version of Unicode and a Unicode String, a conformant implementation of Display Processing shall replicate the results given by applying the Display Processing algorithm specified by Section 4, Processing. |
These specifications are logical ones, designed to be straightforward to describe. An actual implementation is free to use different methods as long the result is the same as the result specified by the logical algorithm.
Any conformant implementation may also have tighter validity criteria than those imposed by Section 6, Validity Criteria. For example, an application could disallow or warn of domain name labels with certain characteristics. For example:
For more information, see UTR#36: Unicode Security Considerations [UTR36].
The input to Unicode IDNA Compatibility Processing is a prospective domain_name string expressed in Unicode. The domain name consists of a sequence of labels with dot separators, such as "Bücher.de".
Note: For more information about the composition of a URL, see Section 3.5 of [RFC1034].
The input domain_name string must have had all escaped Unicode code points converted to Unicode code points. For example,
U+5341 ( 十 ) CJK UNIFIED IDEOGRAPH-5341 could have been escaped as any of the following:
The following steps, performed in order, successively alter the input domain_name string and then output it as a converted Unicode string.
Any input domain_name string that does not abort with an error in the application of these steps is valid according to this specification. Conversely, if an input domain_name string causes an error, then that input input domain_name string is not valid. The processing is idempotent—reapplying the processing to the output will make no further changes. For examples, see Table 2, Examples of Lookup Processing.
There are two types of processing: Lookup Processing and Display Processing. These differ only in how the deviation code points in the mapping table are handled. The result of Lookup processing can be converted to a domain name string containing Punycode labels ("asciified").
Note: Some browsers allow also characters such as underscore ("_") in domain names. Any such extension is outside of the scope of this document.
Implementations are advised to apply additional tests to these labels such as those described in UTR#36: Unicode Security Considerations [UTR36], and take appropriate actions. For example, a label with mixed scripts or confusables may be called out in the UI.
| Input | Step 1 | Step 2 | Step 3 | Step 4 | Comment |
Bloß.de |
bloss.de |
= |
= |
valid |
maps uppercase and eszett |
u¨.com |
= |
ü.com |
= |
valid |
normalizes u + umlaut |
xn--tda.com |
xn--tda.com |
= |
ü.com |
valid |
xn--tda = ü |
xn--u-ccb.com |
= |
= |
u¨.com |
error |
xn--u-ccb = u + umlaut |
a⒈com |
error |
⒈is not valid |
|||
xn--a-ecp.ru |
xn--a-ecp.ru |
= |
a⒈.ru |
error |
xn--a-ecp = a⒈ |
xn--a.pt |
xn--a.pt |
= |
error |
invalid Punycode | |
日本語。JP |
日本語.jp |
= |
= |
valid |
mapping full width characters |
☕.us |
= |
= |
= |
valid |
post Unicode 3.2 character |
There are a number of optimizations can be applied to this processing. These optimizations can improve performance, reduce table size, make use of existing NFKC transform mechanisms, and so on. For example:
For each code point in Unicode, the IDNA Mapping Table provides a status value. If this status value is mapped or deviation, the table also supplies a mapping value for that code point. A table is provided for each version of Unicode starting with Unicode 5.1, in versioned directories under [IDNA-Table]. Each table for a version of the Unicode Standard will always be backwards compatible with previous versions of the table: only characters with the status value disallowed may change in status or mapping value.
A description of the derivation of these tables is provided in Section 7, Mapping Table Derivation. As for derived properties in the Unicode Character Database, the description of the derivation is informative. Only the data in IDNA Mapping Table is normative for the application of this specification.
The files use a semicolon-delimited format similar to those in the Unicode Character Database. The first field is the code point; the second field is the status value; and the third field is the mapping value. Code points are expressed in hexadecimal. The status values are one of the following five values: valid, disallowed, ignored, mapped, and deviation.
Example:
0000..002C ; disallowed # NULL..COMMA
002D ; valid # HYPHEN-MINUS
...
0041 ; mapped ; 0061 # LATIN CAPITAL LETTER A ...
00AD ; ignored # SOFT HYPHEN ... 00DF ; deviation ; 0073 0073 # LATIN SMALL LETTER SHARP S
...
Each of the following criteria must be satisfied for a label to be valid:
In addition, the label should meet the requirements for right-to-left characters specified in the Bidi document of [IDNA2008].
Any particular application may have tighter validity criteria, as discussed in Section 3, Conformance.
The following describes the derivation of the mapping table. Step 1 defines a base mapping value; Steps 2-4 define three sets of characters. These are all used in Step 5 to produce the mapping and status values for the table.
If a Unicode property were to change in a future version in a way that would affect backwards compatibility, a grandfathering clause will be added to maintain compatibility. For more information on compatibility, see Section 5, IDNA Mapping Table.
Step 1: Produce a base mapping value
Step 2: Specify the base valid set
The base valid set is defined by the sequential list of additions and subtractions in Table 3, Base Valid Set. This definition is based on the principles of IDNA2003. When applied to the repertoire of Unicode 3.2 characters, this produces a set which is closely aligned with IDNA2003.
Step 3: Specify the base exclusion set
The exclusion set consists of characters that have a different mapping in IDNA2003 than the base mapping value specified in Step 1. For more information, see the [IDN FAQ]. For this version, the exclusion set consists of the following:
Step 4: Specify the deviation set
This is the set of characters that deviate between IDNA2003 and IDNA2008.
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER
U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
Step 4: Produce the status and mapping values for the table
For each code point:
Note that characters such as U+2488 ( ⒈ ) DIGIT ONE FULL STOP fall under (2a).
Table 4, IDNA Comparisons illustrates the differences between the three specifications in terms of valid character repertoire. It omits all code points unassigned in Unicode 5.2, as well as the ASCII-repertoire code points, because the specifications treat all of those identically. The table has separate groupings for Unicode 3.2 (the only characters valid in IDNA2003) and beyond. It also separates groups buckets where UTS46 and IDNA2008 behave the same from those where they behave differently.
Each row in the table defines a bucket of code points that share a pattern of behavior across the three specifications. The columns provide the following information:
[Review Note: The table will need to be regenerated for Unicode 5.2 and for any changes are made in IDNA2008 before it goes final. It may also need some tweaking for label separators.]
For their contributions of ideas or text to this specification, thanks to Matitiahu Allouche, Peter Constable, Craig Cummings, Martin Dürst, Peter Edberg, Deborah Goldsmith, Laurentiu Iancu, Gervase Markham, Simon Montagu, Lisa Moore, Eric Muller, Murray Sargent, Markus Scherer, Jungshik Shin, Shawn Steele, Erik van der Poel, Chris Weber, and Ken Whistler. The specification builds upon [IDNA2008], developed in the IETF Idnabis working group, especially contributions from Matitiahu Allouche, Harald Alvestrand, Vint Cerf, Martin J. Dürst, Lisa Dusseault, Patrik Fältström, Paul Hoffman, Cary Karp, John Klensin, and Peter Resnick, and also upon [IDNA2003], authored by Marc Blanchet, Adam Costello, Patrik Fältström, and Paul Hoffman.
References not listed here may be found in http://www.unicode.org/reports/tr41/#UAX41.
| [Bortzmeyer] | http://www.bortzmeyer.org/idn-et-phishing.html (machine translated at http://translate.google.com/translate?u=http%3A%2F%2Fwww.bortzmeyer.org%2Fidn-et-phishing.html) |
| [Feedback] | Reporting Errors and Requesting
Information Online http://www.unicode.org/reporting.html |
| [IDNA2003] | The IDNA2003 specification is defined by a cluster of IETF RFCs: |
| [IDNA2008] | The draft IDNA2008 specification is defined by a cluster of IETF RFCs:
For more information, see http://tools.ietf.org/id/idnabis. [Review Note: Fix the references once IDNA2008 is final, and use the formal titles references in the text.] |
| [IDNA-Table] | http://www.unicode.org/Public/idna |
| [IDN-Demo] | http://unicode.org/cldr/utility/idna.jsp |
| [IDN-FAQ] | http://www.unicode.org/faq/idn.html |
| [NFKC_CaseFold] | The Unicode property specified in [UAX44], and defined by the data in DerivedNormalizationProps.txt (search for "NFKC_CaseFold"). |
| [Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
| [RFC1034] | P. Mockapetris. "DOMAIN NAMES - CONCEPTS AND FACILITIES", RFC1034, November 1987 http://tools.ietf.org/html/rfc1034 |
| [RFC3454] | P. Hoffman, M. Blanchet. "Preparation of Internationalized
Strings ("stringprep")", RFC 3454, December 2002. http://ietf.org/rfc/rfc3454.txt |
| [RFC3490] | Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing
Domain Names in Applications (IDNA)", RFC 3490, March 2003. http://ietf.org/rfc/rfc3490.txt |
| [RFC3491] | Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. http://ietf.org/rfc/rfc3491.txt |
| [RFC3492] | Costello, A., "Punycode: A Bootstring encoding of
Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March
2003. http://ietf.org/rfc/rfc3492.txt |
| [SafeBrowsing] | http://code.google.com/apis/safebrowsing/ |
| [Unicode] | The Unicode Standard For the latest version see: http://www.unicode.org/versions/latest/. |
| [Versions] | Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
The following summarizes modifications from the previous revisions of this document.
Revision 2
Revision 1
Copyright © 2008-2009 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.