Technical Reports |
Version | 1 (draft 2) |
Authors | Mark Davis (markdavis@google.com), Michel Suignard |
Date | 2009-08-06 |
This Version | http://www.unicode.org/reports/tr46/tr46-1.html |
Previous Version | n/a |
Latest Version | http://www.unicode.org/reports/tr46/ |
Revision | 1 |
This document provides a specification for processing that provides for compatibility between older and newer versions of internationalized domain names (IDN). It allows applications (browsers, emailers, and so on) to be able to handle both the original version of internationalized domain names(IDNA2003) and the newer version (IDNA2008), avoiding possible interoperability and security problems.
[Review Note: At this point, IDNA2008 is still in development, so this draft may change as that draft changes. The following is a substantial reorganization of the former draft; the changes are not tracked with yellow highlighting. The text is rough (not yet wordsmithed or copyedited), and the references need to be added (and linked).]
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
One of the great strengths of domain names is universality. With http://Apple.com, you can get to Apple's website no matter where you are in the world, and no matter which browser you are using. With markdavis@google.com, you can send an email to the author of this specification, no matter which country you are in, and no matter which emailer you are using.
Initially, domain names were restricted to only handling ASCII characters. This is was a significant burden on people using other characters. Suppose, for example, that the domain name system had been invented by Greeks, and one had to use only Greek characters in URLs. Rather than apple.com, one would have to write something like αππλε.κομ. An English speaker would have to not only be acquainted with Greek characters, but also have to pick the ones that would correspond to the desired English letters. One would have to guess at the spelling of particular words, because there are not exact matches between scripts. A large majority of the world’s population faced this situation because their languages use non-ASCII characters.
In 2003, a system was put in place for internationalized domain names (IDNs), called IDNA2003. This system allows non-ASCII Unicode characters; both different scripts such as Greek, Cyrillic, Tamil, or Korean, and also for non-ASCII Latin characters such as Å, Ħ, or Þ. The mechanism for doing this involves basically (a) transforming (mapping) the string to remove case and other variant differences, (b) checking for validity, and (c) transforming the Unicode characters using a specialized encoding called Punycode. For example, one can now type in "http://Bücher.de" into the address bar of any modern browser, and it will go to a corresponding site, even though the "ü" is not an ASCII character. For this case, the Punycode value actually used for the domain names on the wire is "http://xn--bcher-kva.de". When received from the DNS system, the Punycode version be transformed back into Unicode form for display: the result will be the mapped version, so in this example we get: "Bücher.de" → "xn--bcher-kva.de" → "bücher.de".
The IDNA2003 specification is defined by a cluster of IETF RFCs: the IDNA base specification [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep [RFC3454].
There is a new version of IDNA called IDNA2008 (the "2008" does not reflect the date of approval, which is expected to be late 2009). For the most common cases, IDNA2003 and IDNA2008 behave identically. Both map a user-visible Unicode form of a URL (like http://öbb.at) to a transformed version with only ASCII characters that is actually sent over the wire, the Punycode version (like http://xn--bb-eka.at). However, IDNA2008 does not maintain backwards compatibility with IDNA2003. The main differences between the two are:
The Deviations and Unpredictables in IDNA2008 may cause both interoperability and security problems. They also affect extremely common characters: all uppercase characters, all variant-width characters (in common use in Japan, China, and Korea), and certain other common characters like the German eszett (U+00DF ß LATIN SMALL LETTER SHARP S) and Greek final sigma (U+03C2 ς GREEK SMALL LETTER FINAL SIGMA). The following provides more background for understanding these issues.
IDNA2003 requires a mapping phase, which maps http://ÖBB.at to http://öbb.at (for example). Mapping typically involves mapping uppercase characters to their lowercase pairs, but it also involves other types of mappings between equivalent characters, such as mapping half-width katakana characters to normal (full-width) katakana characters in Japanese. The mapping phase in IDNA2003 was included to match the insensitivity of ASCII domain names. Users are accustomed to having both http://CNN.com and http://cnn.com work identically. They would not expect the addition of an accent to make a difference: they expect that if http://Bruder.com is the same as http://bruder.com, then of course http://Brüder.com is the same as http://brüder.com. There are variants similar to case in this respect used in other scripts. The IDNA2003 mapping is based on data specified by Unicode: what later became the Unicode property NFKC_CaseFold.
There are a few situations where IDNA2008-Strict will always result in the resolution of IDNs to different IP addresses than in IDNA2003. This affects a relatively small number of characters, but some that are relatively common in particular languages and will affect a significant number of strings in those languages. (For more information on why IDNA2003 does this, see the FAQ.) These are referred to as "Deviations"; the significant ones are listed below.
Code | Character | IDNA2008 | IDNA2003 | Example: IDNA2008 |
Example: IDNA2003 |
U+00DF | ß | ß | ss | http://faß.de | http://fass.de |
U+03C2 | ς | ς | σ | http://βόλος.com | http://βόλοσ.com |
U+200D | ZWJ | ZWJ | delete | [TBD] | |
U+200C | ZWNJ | ZWNJ | deleted | [TBD] |
These differences allow for security exploits. Consider http://www.sparkasse-gießen.de, which is for the "Gießen Savings and Loan".
Alice ends up at the phishing site, supplies her bank password, and is robbed. While DENIC might have a policy about bundling all of the variants of ß together (so that they all have the same owner) it is not required of registries. It is quite unlikely, that all registries will have or enforce such a possibility.
There are two Deviations of particular concern. IDNA2008 allows ZWJ and ZWNJ characters in labels—these were removed by the IDNA2003 mapping. In addition to mapping differently, they represent a special security concern because they are normally invisible. That is, the sequence "a<ZWJ>b" looks just like "ab". IDNA2008 does provide a special category for characters like this (called CONTEXTJ), and only permits them in certain contexts (certain sequences of Arabic or Indic characters, for example). However, lookup applications are not required to check for these contexts, so overall security is dependent on registries' having correct implementations. However, those context restrictions do not catch all confusables, and applications are not required to apply any checks whatsover (context or validity) on so-called A-Labels.
IDNA2008 does not require a mapping phase, but
does permit one (called "Local Mapping") with no limitations on what
the mapping can do to disallowed characters (including even ASCII uppercase
characters, if they occur in an IDN). For more information on the permitted
mappings, see Section 4.3 and Section 5.3 in [Protocol]. IDNA2008 implementations can thus be grouped into
four main general categories, given in the table below.
Category | Description | Comments |
Strict | No mapping | Thus rejecting http://ÖBB.at but permitting http://öbb.at |
Hybrid | Map as in IDNA2003 & disallow symbols | Uses Unicode NFKC_CaseFold. Thus it will allow http://ÖBB.at, mapping it to http://öbb.at. |
Compatible | Map as in IDNA2003 & allow symbols | Same as Hybrid, except that it also allows IDNs like http://√.com. (See Subtractions.) |
Custom | Non-standard mapping | Arbitrary other mappings, as allowed in the current draft of IDNA2008. |
For more information on dealing with confusables, see UTR#36: Unicode Security Considerations [UTR36].
To allow applications to work around the incompatibilities between these two specifications, this document provides a standardized preprocessing that allows conformant implementations to minimize, to the extent possible, the problems caused by the differences between IDNA2003 and IDNA2008.
The requirements for conformance on implementations of the Unicode IDNA Compatible Preprocessing are as follows:
C1 | Given a version of Unicode, a Unicode String, a Mapping Mode, and a Validity Mode, a conformant implementation of Unicode IDNA Compatible Preprocessing (UICP) shall replicate the results given by applying the algorithm specified by Section 3, Preprocessing. |
C2 | Both conformant Hybrid IDNA and Compatible IDNA implementations first apply UICP with the Strict mapping mode and the Strict validity mode. If there is no error, the resulting string is transformed to Punycode (label-by-label) and a DNS lookup is performed. |
C3 | If the first lookup fails, a Hybrid IDNA implementation applies UICP with the Lenient mapping mode and the Strict validity mode. If there is no error, the resulting string is transformed to Punycode (label-by-label) and a DNS lookup is performed. |
C4 | If the first lookup fails, a Compatible IDNA implementation applies UICP with the Lenient mapping mode and the Lenient validity mode. If there is no error, the resulting string is transformed to Punycode (label-by-label) and a DNS lookup is performed. |
The UICP used in C2 (Strict mapping and validity) is quite close to IDNA2008. The difference is that there is a standardized mapping that is as compatible with IDNA2003 as possible (while preserving IDNA2008 label validity testing). The UICP used in C4 (Lenient mapping and validity) is quite close to IDNA2003 (but extended to be Unicode-version-independent). The UICP used in C3 is between those two. The transformation to Punycode is applied label-by-label, and only to labels that contain non-ASCII characters.
Importantly, neither the Hybrid nor Compatible implementations can prevent the security and interoperability problems caused by Deviations in IDNA2008. They do prevent the security and interoperability problems caused by the Unpredictables.
[Review Note: An possible alternative for preventing Deviation problems would be adding the following:
An implementation must map the Deviations according to Unicode NFKC_CaseFold unless the registries for the domain name is trusted. A trusted registry is one that is complies with this specification, and bundles all allowed Deviations with their mappings.
- For example, http://www.sparkasse-gießen.de (if the registry for "de" bundles Diviations) would be unaltered, but that http://www.sparkasse-gießen.com would be mapped to http://www.sparkasse-giessen.com (if the "com" registry does not bundle Deviations) before any lookup. Note that this also applies to lower-level registries. The URL http://www.sparkasse-gießen.blogspot.de would be remapped to http://www.sparkasse-gießen.blogspot.de unless the registry for "blogspot.de" is trusted.
Incorporation of this policy would require other changes to the rest of this document.]
Note: To meet user expectations, it is recommended that when converting strings from Punycode back to Unicode, a GREEK LETTER SIGMA that is final (with a letter before and none after) be converted to GREEK LETTER FINAL SIGMA.
These algorithms are logical specifications, designed to be straightforward to describe. An actual implementation is free to use different methods as long the result is the same as the result generated by the logical algorithm. For example, there is no need for a second lookup in C3 or C4 if second transformed string is the same as the first. In fact, an optimized implementation can do a single pass, generating the C2 mapping and C3/C4 mapping (if different) at the same time.
The inputs to the preprocessing are
Preparation of the input domain_name string may have involved converting escapes in an original domain name string to Unicode code points as necessary, depending on the environment in which it is being used. For example, this can include converting:
U+5341
( 十 ) CJK UNIFIED IDEOGRAPH-5341
U+5341
( 十 ) CJK UNIFIED IDEOGRAPH-5341
U+002E
( . ) FULL STOP.
The following series of steps, performed in order, transforms the input domain_name string. The input domain_name is successively altered during the application of these steps. The output of this preprocessing is also a Unicode string. The preprocessing is idempotent—applying the preprocessing again to the output will make no further changes. Where the preprocessing results in an "abort with error", the processing fails and the input string is invalid.
U+002E
( . ) FULL STOP as the label delimiter.
Note that the Split processing matches what is commonly done with label delimiters by browsers, whereby characters containing periods are transformed into the NFKC format before labels are separated. Some of these characters are effectively forbidden, because they would result in a sequence of two periods, and thus empty labels. The exact list of characters can be seen with the Unicode utilities using a regular expression:
However, if the mapping mode is Strict, the only characters in the original string that represent label separators will in fact be the ASCII periods.
Note also that some browsers allow characters like "_" in domain names. Any such treatment is outside of the scope of this document.
The following characters are the only ones allowed in the respective Modes. The sets are defined by properties according to the syntax of UTS#18: Unicode Regular Expressions [UTS18] (with additional "+" signs added for clarity).
[Review Note: The sets should be made into table formats, with explanatory comments on each line. Before release, the formulations will be tested against IDNA2008 to assure that the characters match.]
The following defines the set of allowed characters in Strict mode. This set corresponds to the union of the PVALID, CONTEXTJ, and CONTEXTO characters with rules defined by [IDNA2008-Tables].
[
[:^changes_under_nfkc_casefold:]
- [:c:] - [:z:] - [:s:] - [:p:] - [:nl:] - [:no:] - [:me:]
- [:HST=L:] - [:HST=V:] - [:HST=V:]
- [:block=Combining_Diacritical_Marks_For_Symbols:]
- [:block=Musical_Symbols:]
- [:block=Ancient_Greek_Musical_Notation:]
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [:JoinControl:]
+ [\u00DF \u03C2 \u06FD \u06FE \u0F0B \u3007]
+ [\u002D \u00B7 \u0375 \u05F3 \u05F4 \u30FB]
]
The following defines the set of allowed characters in Lenient mode. These correspond to the characters that can occur in the output of IDNA2003.
[
[:^changes_under_nfkc_casefold:]
- [:c:] - [:z:]
- [:Block=Ideographic_Description_Characters:]
- [:ascii:] - [\u1806 \uFFFC \uFFFD]
+ [A-Za-z0-9\-]
]
There are three versions of the IDNA Mapping Table, according to the desired mapping and validity modes.
Apply all and only those mappings from the Unicode Property [NFKC_CaseFold], where
These two conditions allow for the entire domain name to be mapped, and yet maintain compatibility with IDNA2008.
Apply all the mappings from the Unicode Property [NFKC_CaseFold], where
In addition, map the U+3002 ( 。 ) IDEOGRAPHIC FULL STOP and any compatibility equivalents of it to the character U+002E ( . ) FULL STOP.
For Unicode 5.2 and before, those additions consists of exactly two characters:
Using Lenient validity mode conditions provides for the most compatibility with IDNA2003, while the Strict validity mode excludes symbols.
While the above describes the generation of the mapping tables, the normative values are supplied in the linked data files. For each version of Unicode there will be an updated version of this table: implementations will never need to actually use the above method algorithm for generating the tables—they can just use the data from the tables in the Preprocessing algorithm. Future versions are guaranteed to be as compatible as possible (that is, subject to possible incompatible changes in the IETF definition of IDNA).
[Review Note: A full list of the mappings for each mode will be maintained and linked from this document.]
There are two versions of the Validity testing, according to the corresponding validity Mode.
Each of the following criteria are required for Strict Validity:
[Review Note: Once IDNA2008 is final, the exact specifications can be substituted for the last two bullets, making the above self-contained.]
These conditions, together with the mappings, are slightly stronger than the conditions required on lookup in [IDNA2008-Protocol].
[Review Note: Add other required validity checking for IDNA2003 and IDNA2008, and a recommendation to always apply the IDNA2008 BIDI restrictions.]
[Review Note: The intent is to add conformance test files linked from here, so that implementations can test their implementations against a set of data.]
The worst of all possible cases is an IDNA2008-Custom implementation. Unfortunately, there appears to be no good way to prevent security problems with IDNA2008 Custom implementations, because it is impossible to anticipate what such implementations would do. Such an implementation is not limited to just the above four Deviations for exploits—it could remap even characters like "A" or "B" to an arbitrary other character (or sequence). Because there is no way to predict what it will do, there are no effective countermeasures.
Clients such as search engines have another practical issue facing them. They will probably opt for Compatible, allowing all valid IDNA2003 characters so that they can access all of the web. Normally they also need to canonicalize URLs, so that they can determine when two URLs are actually the same. For IDNA2003 this was straightforward. For Hybrid/Compatible implementations, the canonicalization can result in two different possibilities (depending on the mapping), and two lookups have to be performed in order to resolve them. However, the success of those lookups may change over the time period in which the URL is stored, so this solution is not completely robust, and involves many complications in the search pipeline.
Whatever approach is taken, IDNA2008 does not make any appreciable difference in reducing problems with visually-confusable characters (so-called homographs). Thus programmers still need to be aware of those issues as detailed in UTR#36: Unicode Security Considerations [UTR36], including the mechanisms for detecting potentially visually-confusable characters are found in the associated UTS#39: Unicode Security Mechanisms [UTS39].
To reduce security concerns, the Custom variant is strongly discouraged, to avoid indeterminacies which can cause security problems. To maintain compatibility, it is anticipated that few implementations will opt for the Strict variant. That is, most would implement either Hybrid or Compatible in the near term. Once sufficiently many high-level registries disallow symbols, the Compatible implementations could probably move towards Hybrid. It is unclear when, if ever, it would reasonable for those implementations to move to being Strict.
[Review Note: This material is probably best moved to the Unicode FAQ, and just referenced from here. Included for review in case any of the material should stay here..]
A. Here is a table that illustrates the differences, where 2003 is the current behavior.
2003 | Compatible | Hybrid | Strict | Custom | Comments | |
http://öbb.at | Yes | Yes | Yes | Yes | Yes | Simple characters |
http://ÖBB.at | Yes | Yes | Yes | No | ? | Case mapping |
http://√.com | Yes | Yes | No | No | ? | Symbol |
http://faß.de | Yes | Yes* | Yes* | Yes* | Yes* | Special (different IP address) |
http://ԛәлп.com | No | Yes | Yes | Yes | Yes | New Unicode (version 5.1) U+051B (ԛ) cyrillic qa |
[Review Note: It is probably not worth listing the advantages and disadvantages of IDNA2008]
A. The main advantages are:Examples:
A. No. The exploits don't require unscrupulous registries—it only requires that registries don't police every URL that they register for possible spoofing behavior.
The custom mappings matter to security, because entering the same URL on two different browsers may go to two different IP addresses (whenever the two browsers have different custom mappings). The same thing could happen within an emailer that is parsing for URLs, and then opening a browser. And for that matter, there is nothing in the spec that prevents two different browsers from applying those custom mappings to URLs within a page, eg to an href="...".
A. This is to provide full case insensitivity, following the Unicode Standard. These characters are anomalous: the uppercase of ς is Σ, the same as the uppercase of σ. Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς are not treated as case variants, there wouldn't be a match between ΒόλοΣ and Βόλος.
Similarly, the standard uppercase of ß is "SS", the same as the uppercase of "ss". Note, for example, that on http://www.uni-giessen.de, Gießen is spelled with ß, but in the top left corner spelled with GIESSEN. The situation is even more complicated:
For full case insensitivity (with transitivity), {ss, ß, SS} and {σ, ς, Σ} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by IDNA2003.
ZWJ and ZWNJ are normally invisible, which allows them to be used for a variety of spoofs. Invisible characters (like these and soft-hyphen) are allowed on input in IDNA2003, but deleted so that they don't allow spoofs.
During the development of Unicode, the ZWJ and ZWNJ were intended only for presentation —that is, they would make no difference in the semantics of a word. Thus the IDNA2003 mapping should and does delete them. That result, however, should never really be seen by users - it should be just a transient form used for comparison. Unfortunately, the way IDN works this "comparison format" (with transformations of eszett, final sigma, and deleted ZWJ/NJ) ends up being visible to the user.
There are words such as the name of the country of Sri Lanka, which require preservation of these joiners (in this case, ZWJ) in order to appear correct to the end users when the URL comes back from the DNS server.
A. No, The eszett and sigma are fundamentally different than I,l,1. With the following (using a digit 1), all browsers will go to the same location, whether they old or new:
goog1e.com
With the following, browsers that use IDNA2003 will go to a different location than browsers that use IDNA2008, unless the registry for xx puts into place a bundle strategy.
gießen.xx
The same goes for Greek sigma, which is a more common character in Greek than the eszett is in German.
A. It is extremely difficult to restrict on the basis of language, because the letters used in a particular language are not well defined. The "core" letters typically are, but many others are typically accepted in loan words, and have perfectly legitimate commercial and social use.
It is a bit easier to maintain a bright line based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). Even there it is problematic to have that as a restriction. Some languages (Japanese) require multiple scripts. And in most cases, mixtures of scripts are harmless. One can have SONY日本.com with no problems at all—while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.
The rough consensus among the working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA2008 is no different than IDNA2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to have a bright line. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.
Responsible registries can apply such restrictions. For example, a country-level registry can decide on a restricted set of characters appropriate for that country's languages. Application software also take certain precautions—MSIE, Safari, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. For more information on the kinds of techniques that implementations can use on the Unicode web site, see UTR#36: Unicode Security Considerations [UTR36].
A. Surprisingly, not really. It doesn't do anything about the most frequent sources of spoofing; look-alike characters that are both letters, like "paypal.com" with a Cyrillic "a". If a symbol that can be used to spoof a letter X is removed, but there is another letter that can spoof X is retained, there is no net benefit. Weighted by frequency, according to data at Google the removal of symbols and punctuation in IDNA2008 reduces opportunities for spoofing by only about 0.000016%. In another study at Google of 1B web pages, the top 277 confusable URLs used confusable letters or numbers, not symbols or punctuation. The 278th page had a confusable URL with × (U+00D7 MULTIPLICATION SIGN - by far the most common confusable); but that page could could be even better spoofed with х (U+0445 CYRILLIC SMALL LETTER HA), which normally has precisely the same displayed shape as "x".
There is a very significant security loophole in that IDNA2008 does not require a client to do any checks whatsoever on a Punycode version, like on "http://xn--iny-zx5a.com" (which contains a symbol). That is, a conformant browser "MAY" do those checks, but doesn't have to. Any such browser is completely dependent on the registry's being safe
IDNA2003 was quite clear; a specified, standardized mapping, and required checks on Punycode versions. IDNA2008 does not, and introduces Deviations on top of that.
[TBD].
[TBD].
(http://tools.ietf.org/id/idnabis )
The following summarizes modifications from the previous revisions of this document.
Version 1
Copyright © 2009 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.