Working Draft of Proposed Draft Unicode Technical Standard #46

Unicode IDNA Compatible Preprocesssing

Version	1 (draft 2)
Authors	Mark Davis (markdavis@google.com), Michel Suignard
Date	2009-08-06
This Version	http://www.unicode.org/reports/tr46/tr46-1.html
Previous Version	n/a
Latest Version	http://www.unicode.org/reports/tr46/
Revision	1

Summary

This document provides a specification for processing that provides for compatibility between older and newer versions of internationalized domain names (IDN). It allows applications (browsers, emailers, and so on) to be able to handle both the original version of internationalized domain names(IDNA2003) and the newer version (IDNA2008), avoiding possible interoperability and security problems.

[Review Note: At this point, IDNA2008 is still in development, so this draft may change as that draft changes. The following is a substantial reorganization of the former draft; the changes are not tracked with yellow highlighting. The text is rough (not yet wordsmithed or copyedited), and the references need to be added (and linked).]

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1 Introduction
- 1.1 IDNA2003
- 1.2 IDNA2008
- 1.3 Security Considerations
2 Conformance
3 Preprocessing
4 Allowed Characters
- 4.1 Strict_Allowed
- 4.2 Lenient_Allowed
5 IDNA Mapping Table
- 5.1 Strict Mode Mapping
- 5.2 Lenient Mode Mapping
- 5.3 Mapping Stability
6 Validity Criteria
- 6.1 Strict Validity Criteria
- 6.2 Lenient Validity Criteria
7 Testing
8 Tactics
9 FAQ
Acknowledgements
References
Modifications

1. Introduction

One of the great strengths of domain names is universality. With http://Apple.com, you can get to Apple's website no matter where you are in the world, and no matter which browser you are using. With markdavis@google.com, you can send an email to the author of this specification, no matter which country you are in, and no matter which emailer you are using.

Initially, domain names were restricted to only handling ASCII characters. This is was a significant burden on people using other characters. Suppose, for example, that the domain name system had been invented by Greeks, and one had to use only Greek characters in URLs. Rather than apple.com, one would have to write something like αππλε.κομ. An English speaker would have to not only be acquainted with Greek characters, but also have to pick the ones that would correspond to the desired English letters. One would have to guess at the spelling of particular words, because there are not exact matches between scripts. A large majority of the world’s population faced this situation because their languages use non-ASCII characters.

1.1 IDNA2003

In 2003, a system was put in place for internationalized domain names (IDNs), called IDNA2003. This system allows non-ASCII Unicode characters; both different scripts such as Greek, Cyrillic, Tamil, or Korean, and also for non-ASCII Latin characters such as Å, Ħ, or Þ. The mechanism for doing this involves basically (a) transforming (mapping) the string to remove case and other variant differences, (b) checking for validity, and (c) transforming the Unicode characters using a specialized encoding called Punycode. For example, one can now type in "http://Bücher.de" into the address bar of any modern browser, and it will go to a corresponding site, even though the "ü" is not an ASCII character. For this case, the Punycode value actually used for the domain names on the wire is "http://xn--bcher-kva.de". When received from the DNS system, the Punycode version be transformed back into Unicode form for display: the result will be the mapped version, so in this example we get: "Bücher.de" → "xn--bcher-kva.de" → "bücher.de".

The IDNA2003 specification is defined by a cluster of IETF RFCs: the IDNA base specification [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep [RFC3454].

1.2 IDNA2008

There is a new version of IDNA called IDNA2008 (the "2008" does not reflect the date of approval, which is expected to be late 2009). For the most common cases, IDNA2003 and IDNA2008 behave identically. Both map a user-visible Unicode form of a URL (like http://öbb.at) to a transformed version with only ASCII characters that is actually sent over the wire, the Punycode version (like http://xn--bb-eka.at). However, IDNA2008 does not maintain backwards compatibility with IDNA2003. The main differences between the two are:

Additions. Some IDNs are invalid in IDNA2003, but valid in IDNA2008.
Subtractions. Some IDNs are valid in IDNA2003, but invalid in IDNA2008.
Deviations. Some IDNs are valid in both, but resolve to different destinations.
Unpredictables. Some IDNs do not have predictable behavior in IDNA2008, due to "Local Mappings". They may fail, or may have any of the above characteristics.

1.3 Security Considerations

The Deviations and Unpredictables in IDNA2008 may cause both interoperability and security problems. They also affect extremely common characters: all uppercase characters, all variant-width characters (in common use in Japan, China, and Korea), and certain other common characters like the German eszett (U+00DF ß LATIN SMALL LETTER SHARP S) and Greek final sigma (U+03C2 ς GREEK SMALL LETTER FINAL SIGMA). The following provides more background for understanding these issues.

IDNA2003 requires a mapping phase, which maps http://ÖBB.at to http://öbb.at (for example). Mapping typically involves mapping uppercase characters to their lowercase pairs, but it also involves other types of mappings between equivalent characters, such as mapping half-width katakana characters to normal (full-width) katakana characters in Japanese. The mapping phase in IDNA2003 was included to match the insensitivity of ASCII domain names. Users are accustomed to having both http://CNN.com and http://cnn.com work identically. They would not expect the addition of an accent to make a difference: they expect that if http://Bruder.com is the same as http://bruder.com, then of course http://Brüder.com is the same as http://brüder.com. There are variants similar to case in this respect used in other scripts. The IDNA2003 mapping is based on data specified by Unicode: what later became the Unicode property NFKC_CaseFold.

1.3.1 Deviations

There are a few situations where IDNA2008-Strict will always result in the resolution of IDNs to different IP addresses than in IDNA2003. This affects a relatively small number of characters, but some that are relatively common in particular languages and will affect a significant number of strings in those languages. (For more information on why IDNA2003 does this, see the FAQ.) These are referred to as "Deviations"; the significant ones are listed below.

Code	Character	IDNA2008	IDNA2003	Example: IDNA2008	Example: IDNA2003
U+00DF	ß	ß	ss	http://faß.de	http://fass.de
U+03C2	ς	ς	σ	http://βόλος.com	http://βόλοσ.com
U+200D	ZWJ	ZWJ	delete	[TBD]
U+200C	ZWNJ	ZWNJ	deleted	[TBD]

These differences allow for security exploits. Consider http://www.sparkasse-gießen.de, which is for the "Gießen Savings and Loan".

Alice's browser supports IDNA2003. Under those rules, http://www.sparkasse-gießen.de is mapped to http://www.sparkasse-giessen.de, which leads to a site with the IP address oo.kk.aa.yy.
She visits a friend Bob, and checks her bank statement on his browser. His browser supports IDNA2008. Under those rules, http://www.sparkasse-gießen.de is also valid, but converts to a different internal address "http://www.xn--sparkasse-gieen-2ib.de". Unless the "DE" registry bundles, this can lead to a different site with the IP address ee.vv.ii.ll, a spoof site.

Alice ends up at the phishing site, supplies her bank password, and is robbed. While DENIC might have a policy about bundling all of the variants of ß together (so that they all have the same owner) it is not required of registries. It is quite unlikely, that all registries will have or enforce such a possibility.

There are two Deviations of particular concern. IDNA2008 allows ZWJ and ZWNJ characters in labels—these were removed by the IDNA2003 mapping. In addition to mapping differently, they represent a special security concern because they are normally invisible. That is, the sequence "a<ZWJ>b" looks just like "ab". IDNA2008 does provide a special category for characters like this (called CONTEXTJ), and only permits them in certain contexts (certain sequences of Arabic or Indic characters, for example). However, lookup applications are not required to check for these contexts, so overall security is dependent on registries' having correct implementations. However, those context restrictions do not catch all confusables, and applications are not required to apply any checks whatsover (context or validity) on so-called A-Labels.

1.3.2 Unpredictables

IDNA2008 does not require a mapping phase, but does permit one (called "Local Mapping") with no limitations on what the mapping can do to disallowed characters (including even ASCII uppercase characters, if they occur in an IDN). For more information on the permitted mappings, see Section 4.3 and Section 5.3 in [P rotocol]. IDNA2008 implementations can thus be grouped into four main general categories, given in the table below.

Category	Description	Comments
Strict	No mapping	Thus rejecting http://ÖBB.at but permitting http://öbb.at
Hybrid	Map as in IDNA2003 & disallow symbols	Uses Unicode NFKC_CaseFold. Thus it will allow http://ÖBB.at, mapping it to http://öbb.at.
Compatible	Map as in IDNA2003 & allow symbols	Same as Hybrid, except that it also allows IDNs like http://√.com. (See Subtractions.)
Custom	Non-standard mapping	Arbitrary other mappings, as allowed in the current draft of IDNA2008.

A custom implementation, the fourth row, could allow http://ÖBB.at, mapping it to http://øbb.at, or to http://oebb.at, or to http://obb.at, or to anything else, even http:/phishing.com. One IDNA2008-Custom implementation could map http://TÜRKIYE.com to http://türkiye.com while another IDNA2008-Custom implementation could map it to http://türkıye.com (note the dotless i)—and go to a different location. IDNA2008 does define a mapping, but it is not normative, and does not attempt to be compatible with IDNA2003.

For more information on dealing with confusables, see UTR#36: Unicode Security Considerations [UTR36].

To allow applications to work around the incompatibilities between these two specifications, this document provides a standardized preprocessing that allows conformant implementations to minimize, to the extent possible, the problems caused by the differences between IDNA2003 and IDNA2008.

2 Conformance

The requirements for conformance on implementations of the Unicode IDNA Compatible Preprocessing are as follows:

C1	Given a version of Unicode, a Unicode String, a Mapping Mode, and a Validity Mode, a conformant implementation of Unicode IDNA Compatible Preprocessing (UICP) shall replicate the results given by applying the algorithm specified by Section 3, Preprocessing.
C2	Both conformant Hybrid IDNA and Compatible IDNA implementations first apply UICP with the Strict mapping mode and the Strict validity mode. If there is no error, the resulting string is transformed to Punycode (label-by-label) and a DNS lookup is performed.
C3	If the first lookup fails, a Hybrid IDNA implementation applies UICP with the Lenient mapping mode and the Strict validity mode. If there is no error, the resulting string is transformed to Punycode (label-by-label) and a DNS lookup is performed.
C4	If the first lookup fails, a Compatible IDNA implementation applies UICP with the Lenient mapping mode and the Lenient validity mode. If there is no error, the resulting string is transformed to Punycode (label-by-label) and a DNS lookup is performed.

The UICP used in C2 (Strict mapping and validity) is quite close to IDNA2008. The difference is that there is a standardized mapping that is as compatible with IDNA2003 as possible (while preserving IDNA2008 label validity testing). The UICP used in C4 (Lenient mapping and validity) is quite close to IDNA2003 (but extended to be Unicode-version-independent). The UICP used in C3 is between those two. The transformation to Punycode is applied label-by-label, and only to labels that contain non-ASCII characters.

Importantly, neither the Hybrid nor Compatible implementations can prevent the security and interoperability problems caused by Deviations in IDNA2008. They do prevent the security and interoperability problems caused by the Unpredictables.

[Review Note: An possible alternative for preventing Deviation problems would be adding the following:

An implementation must map the Deviations according to Unicode NFKC_CaseFold unless the registries for the domain name is trusted. A trusted registry is one that is complies with this specification, and bundles all allowed Deviations with their mappings.

For example, http://www.sparkasse-gießen.de (if the registry for "de" bundles Diviations) would be unaltered, but that http://www.sparkasse-gießen.com would be mapped to http://www.sparkasse-giessen.com (if the "com" registry does not bundle Deviations) before any lookup. Note that this also applies to lower-level registries. The URL http://www.sparkasse-gießen.blogspot.de would be remapped to http://www.sparkasse-gießen.blogspot.de unless the registry for "blogspot.de" is trusted.

Incorporation of this policy would require other changes to the rest of this document.]

Note: To meet user expectations, it is recommended that when converting strings from Punycode back to Unicode, a GREEK LETTER SIGMA that is final (with a letter before and none after) be converted to GREEK LETTER FINAL SIGMA.

These algorithms are logical specifications, designed to be straightforward to describe. An actual implementation is free to use different methods as long the result is the same as the result generated by the logical algorithm. For example, there is no need for a second lookup in C3 or C4 if second transformed string is the same as the first. In fact, an optimized implementation can do a single pass, generating the C2 mapping and C3/C4 mapping (if different) at the same time.

3. Preprocessing

The inputs to the preprocessing are

a prospective domain_name string in Unicode, which is a sequence of labels with dot separators, such as "Bücher.de". (For more about the parts of a URL, including the domain name, see [RFC3987]).
a mapping Mode, which is either Strict or Lenient.
a validity Mode, which is either Strict or Lenient

Preparation of the input domain_name string may have involved converting escapes in an original domain name string to Unicode code points as necessary, depending on the environment in which it is being used. For example, this can include converting:

HTML numeric character references (NCRs) like 十 for U+5341 ( 十 ) CJK UNIFIED IDEOGRAPH-5341
Javascript escapes like \u5341 for U+5341 ( 十 ) CJK UNIFIED IDEOGRAPH-5341
URI/IRI %-escapes like %2e for U+002E ( . ) FULL STOP.

The following series of steps, performed in order, transforms the input domain_name string. The input domain_name is successively altered during the application of these steps. The output of this preprocessing is also a Unicode string. The preprocessing is idempotent—applying the preprocessing again to the output will make no further changes. Where the preprocessing results in an "abort with error", the processing fails and the input string is invalid.

Map the domain_name string using the IDNA Mapping Table (Section 5), according to the given mapping Mode and validity Mode.
- domain_name = map(domain_name)
Normalize the domain_name string to Unicode Normalization Form C:
- domain_name = toNFC(domain_name)
Split the domain_name string into one or more labels, using the character U+002E ( . ) FULL STOP as the label delimiter.
- Note that with a Lenient mapping, the dot may have resulted from a mapping from other characters, such as U+2488 ( ⒈ ) DIGIT ONE FULL STOP or U+FF0E ( ． ) FULLWIDTH FULL STOP.
Verify that each label in the domain_name meets the validity criteria for the given validity Mode.
- If any label is in Punycode, and does not come from a trusted source, convert back to Unicode before verifying validity.
- Abort with error if it does not comply
Return the string resulting from the successive application of the above steps, if there has been no error.

Note that the Split processing matches what is commonly done with label delimiters by browsers, whereby characters containing periods are transformed into the NFKC format before labels are separated. Some of these characters are effectively forbidden, because they would result in a sequence of two periods, and thus empty labels. The exact list of characters can be seen with the Unicode utilities using a regular expression:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toNFKC=/\./:]

However, if the mapping mode is Strict, the only characters in the original string that represent label separators will in fact be the ASCII periods.

Note also that some browsers allow characters like "_" in domain names. Any such treatment is outside of the scope of this document.

4. Allowed Characters

The following characters are the only ones allowed in the respective Modes. The sets are defined by properties according to the syntax of UTS#18: Unicode Regular Expressions [UTS18] (with additional "+" signs added for clarity).

[Review Note: The sets should be made into table formats, with explanatory comments on each line. Before release, the formulations will be tested against IDNA2008 to assure that the characters match.]

4.1 Strict_Allowed

The following defines the set of allowed characters in Strict mode. This set corresponds to the union of the PVALID, CONTEXTJ, and CONTEXTO characters with rules defined by [IDNA2008-Tables].

[ [:^changes_under_nfkc_casefold:] - [:c:] - [:z:] - [:s:] - [:p:] - [:nl:] - [:no:] - [:me:] - [:HST=L:] - [:HST=V:] - [:HST=V:] - [:block=Combining_Diacritical_Marks_For_Symbols:] - [:block=Musical_Symbols:] - [:block=Ancient_Greek_Musical_Notation:] - [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B] + [:JoinControl:] + [\u00DF \u03C2 \u06FD \u06FE \u0F0B \u3007] + [\u002D \u00B7 \u0375 \u05F3 \u05F4 \u30FB] ]

4.2 Lenient_Allowed

The following defines the set of allowed characters in Lenient mode. These correspond to the characters that can occur in the output of IDNA2003.

[ [:^changes_under_nfkc_casefold:] - [:c:] - [:z:] - [:Block=Ideographic_Description_Characters:] - [:ascii:] - [\u1806 \uFFFC \uFFFD] + [A-Za-z0-9\-] ]

5. IDNA Mapping Table

There are three versions of the IDNA Mapping Table, according to the desired mapping and validity modes.

5.1 Strict Mode Mapping

Apply all and only those mappings from the Unicode Property [NFKC_CaseFold], where

the source character is not Strict_Allowed, and
the mapping does not contain a U+002E ( . ) FULL STOP.

These two conditions allow for the entire domain name to be mapped, and yet maintain compatibility with IDNA2008.

5.2 Lenient Mode Mapping

Apply all the mappings from the Unicode Property [NFKC_CaseFold], where

the source character is not Strict_Allowed - if the validity mode is Strict, or
the source character is not Lenient_Allowed - if the validity mode is Lenient

In addition, map the U+3002 ( 。 ) IDEOGRAPHIC FULL STOP and any compatibility equivalents of it to the character U+002E ( . ) FULL STOP.

For Unicode 5.2 and before, those additions consists of exactly two characters:

U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
U+FF61 ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP

Using Lenient validity mode conditions provides for the most compatibility with IDNA2003, while the Strict validity mode excludes symbols.

5.3 Mapping Stability

While the above describes the generation of the mapping tables, the normative values are supplied in the linked data files. For each version of Unicode there will be an updated version of this table: implementations will never need to actually use the above method algorithm for generating the tables—they can just use the data from the tables in the Preprocessing algorithm. Future versions are guaranteed to be as compatible as possible (that is, subject to possible incompatible changes in the IETF definition of IDNA).

[Review Note: A full list of the mappings for each mode will be maintained and linked from this document.]

6. Validity Criteria

There are two versions of the Validity testing, according to the corresponding validity Mode.

6.1 Strict Validity Criteria

Each of the following criteria are required for Strict Validity:

The label must contain at least one code point.
All code points in the label must be in the set defined as Strict_Allowed, listed below.
The label must not contain "--" (two U+002D ( - ) HYPHEN-MINUS characters) in the third and fourth positions, and must neither begin nor end with a U+002D ( - ) HYPHEN-MINUS character.
The label must not begin with a combining mark, that is: [:gc=M:]
If the label contains any [:Join_Control:] characters, any such characters must only occur in contexts specified in [Tables] for CONTEXTJ characters.
The label must meet the requirements for right-to-left characters specified in [IDNA2008-BIDI]

[Review Note: Once IDNA2008 is final, the exact specifications can be substituted for the last two bullets, making the above self-contained.]

These conditions, together with the mappings, are slightly stronger than the conditions required on lookup in [IDNA2008-Protocol].

6.2 Lenient Validity Criteria

[Review Note: Add other required validity checking for IDNA2003 and IDNA2008, and a recommendation to always apply the IDNA2008 BIDI restrictions.]

7Testing

[Review Note: The intent is to add conformance test files linked from here, so that implementations can test their implementations against a set of data.]

8 Tactics

For compatibility in the foreseeable future, special steps need to be taken with Deviations. While some steps could be taken by top-level domain registries to mitigate the above problems (the so-called "bundle" option), there are a very large number of lower level domains that are under the control of thousands of other organizations. For example, the domain names under "blogspot.com", such as http://café.blogspot.com, are controlled by the company that has registered "blogspot". For IDNA2008 to avoid problems, no registries—at whatever level —would allow two IDNs that correspond according to the Deviations table to resolve to different IP addresses. So blogspot.com would need to disallow registration of both the registration of http://gefäss.blogspot.com and of http://gefäß.blogspot.com, to prevent problems (and of other cases like the normally-invisible ZWJ and ZWNJ). However, applications cannot depend on all such registries behaving correctly, because the odds are high that at least some (and perhaps many) of the many thousands of registries will not check for this. Thus the burden is primarily on applications handling IDNs to prevent the situation.

The worst of all possible cases is an IDNA2008-Custom implementation. Unfortunately, there appears to be no good way to prevent security problems with IDNA2008 Custom implementations, because it is impossible to anticipate what such implementations would do. Such an implementation is not limited to just the above four Deviations for exploits—it could remap even characters like "A" or "B" to an arbitrary other character (or sequence). Because there is no way to predict what it will do, there are no effective countermeasures.

Clients such as search engines have another practical issue facing them. They will probably opt for Compatible, allowing all valid IDNA2003 characters so that they can access all of the web. Normally they also need to canonicalize URLs, so that they can determine when two URLs are actually the same. For IDNA2003 this was straightforward. For Hybrid/Compatible implementations, the canonicalization can result in two different possibilities (depending on the mapping), and two lookups have to be performed in order to resolve them. However, the success of those lookups may change over the time period in which the URL is stored, so this solution is not completely robust, and involves many complications in the search pipeline.

Whatever approach is taken, IDNA2008 does not make any appreciable difference in reducing problems with visually-confusable characters (so-called homographs). Thus programmers still need to be aware of those issues as detailed in UTR#36: Unicode Security Considerations [UTR36], including the mechanisms for detecting potentially visually-confusable characters are found in the associated UTS#39: Unicode Security Mechanisms [UTS39].

To reduce security concerns, the Custom variant is strongly discouraged, to avoid indeterminacies which can cause security problems. To maintain compatibility, it is anticipated that few implementations will opt for the Strict variant. That is, most would implement either Hybrid or Compatible in the near term. Once sufficiently many high-level registries disallow symbols, the Compatible implementations could probably move towards Hybrid. It is unclear when, if ever, it would reasonable for those implementations to move to being Strict.

9 FAQ

[Review Note: This material is probably best moved to the Unicode FAQ, and just referenced from here. Included for review in case any of the material should stay here..]

Q. What are examples of where the different categories of IDNA implementation behave differently?

A. Here is a table that illustrates the differences, where 2003 is the current behavior.

	2003	Compatible	Hybrid	Strict	Custom	Comments
http://öbb.at	Yes	Yes	Yes	Yes	Yes	Simple characters
http://ÖBB.at	Yes	Yes	Yes	No	?	Case mapping
http://√.com	Yes	Yes	No	No	?	Symbol
http://faß.de	Yes	Yes*	Yes*	Yes*	Yes*	Special (different IP address)
http://ԛәлп.com	No	Yes	Yes	Yes	Yes	New Unicode (version 5.1) U+051B (ԛ) cyrillic qa

Q. How much of a problem is this actually if support for symbols like √.com were just dropped immediately?

A. IDNA2008 removes many characters that were valid under IDNA2003, because it makes most symbols and punctuation be illegal. So while http://√.com is valid in an IDNA2003 implementation; it would fail on a strict IDNA2008 implementation. This affects about 2,900 characters, mostly rarely used ones. A small percentage of those 2,900 cases are security risks because of confusability. The vast majority are unproblematic: for example, having http://I♥NY.com doesn't cause security problems. IDNA2008 also has additional tests that are based on the context in which characters are found.

Q. What are the main advantages of IDNA2008?

[Review Note: It is probably not worth listing the advantages and disadvantages of IDNA2008]

A. The main advantages are:

Major improvement in updating to Unicode 5.2
Major improvement in making process of updating to future Unicode versions mostly-automatically
Significant improvement in allowing needed sequences (combining marks at end of bidi label).
Significant improvements to the BIDI rules:
- allowing for some sequences that 2003 should not have restricted (eg, trailing combining marks, needed for Thaana), and restricting sequences that lead to "bidi label hopping". (While these new bidi rules go a long way towards reducing this problem, they do not eliminate it because they do not check for inter-label situations.)
Improvements in some user's expectations for display of Deviations: sigma, sharp s, joiners.
Improvement in clarifying that what people register is the unmapped form.

Q. What are the disadvantages of IDNA2008?

A. If IDNA2003 had not existed, then there would be few disadvantages to IDNA2008. Given that IDNA2003 does exist, and is widely deployed, the main disadvantages are:

Major interoperability/security issue with Deviations and Unpredictables
- Of particular interest are the invisible ZWJ/ZWNJ characters, which offer opportunities for spoofing if not properly restricted.
Significant interoperability issue by not continuing 2003 mappings
Significant increase in complexity, reducing the likelihood of correct implementation
- For example, there are new contextual rules that are fairly complicated to implement, and are not in a machine-readable format. Without a comprehensive test suite and/or reference implementations to test against, it is fairly likely that there will be incompatibilities.
Small interoperability issue caused by excluding symbols, punctuation
More fragile in that future Unicode versions require a manual step to avoid instabilities
No requirements for stability: that all labels valid under Version X (>= 2008) must also be valid under all future versions.

Q. What is "bidi label hopping?

A. It is where bidi reordering causes characters from one label to appear to be part of another label. For example, with "B1.d" in a right-to-left paragraph (where B stands for an Arabic or Hebrew letter), the display would be "1.dB".

Q. Are the "local" mappings just a UI issue?

A. No, not if what is meant is that they are only involved in interactions with the address bar.

Examples:

Alice sees that a URL works in her browser (say http://faß.de or http://TÜRKIYE.com). She sends it to Bob in an email, who clicks on the email representation. He goes to the bad site, because his browser maps to http://fass.de or http://türkiye.com while Alice's maps to http://faß.de or http://türkıye.com.
Alice creates a web page, using <a href=" http://faß.de"> (or http://TÜRKIYE.com). Bob clicks on the link, and goes to a bad site.
- It is generally understood at the W3C that all attributes that take URLs should take full IRIs, not punycoded-URIs, so for example SVG, MathML, XLink, XML, etc, all take IRIs now, as does HTML5.
Alice is in a IM chat with Bob. She copies in http://faß.de (or http://TÜRKIYE.com) and hits return. Bob clicks on the link he sees in his chat window. Bob clicks on the link, and goes to a bad site.
Alice sends a Word document to Bob with a link in it...
Alice creates a PDF document...
...

Q. Do the Custom exploits require unscrupulous registries?

A. No. The exploits don't require unscrupulous registries—it only requires that registries don't police every URL that they register for possible spoofing behavior.

The custom mappings matter to security, because entering the same URL on two different browsers may go to two different IP addresses (whenever the two browsers have different custom mappings). The same thing could happen within an emailer that is parsing for URLs, and then opening a browser. And for that matter, there is nothing in the spec that prevents two different browsers from applying those custom mappings to URLs within a page, eg to an href="...".

Q. Why does IDNA2003 map map final sigma (ς) to sigma (σ), map eszett (ß) to "ss", and delete ZWJ/ZWNJ?

A. This is to provide full case insensitivity, following the Unicode Standard. These characters are anomalous: the uppercase of ς is Σ, the same as the uppercase of σ. Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς are not treated as case variants, there wouldn't be a match between ΒόλοΣ and Βόλος.

Similarly, the standard uppercase of ß is "SS", the same as the uppercase of "ss". Note, for example, that on http://www.uni-giessen.de, Gießen is spelled with ß, but in the top left corner spelled with GIESSEN. The situation is even more complicated:

In Switzerland, "ss" is uniformly used instead of ß.
The recent spelling reform in Germany and Austria changed whether ß or ss is used in many words. For example, http://Schloß.de was the spelling before 1996, and http://Schloss.de is "correct" after.
Recently, in Unicode 5.1, an uppercase version of ß was added (ẞ), because it is attested in some cases. It is unknown, however, whether it will ever become the preferred uppercase. Unicode now treats all of these as a single equivalence class for case-insensitive matching: {ss, ß, SS, ẞ}. See also the Unicode FAQ.

For full case insensitivity (with transitivity), {ss, ß, SS} and {σ, ς, Σ} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by IDNA2003.

ZWJ and ZWNJ are normally invisible, which allows them to be used for a variety of spoofs. Invisible characters (like these and soft-hyphen) are allowed on input in IDNA2003, but deleted so that they don't allow spoofs.

Q. Why allow ZWJ/ZWNJ at all?

During the development of Unicode, the ZWJ and ZWNJ were intended only for presentation —that is, they would make no difference in the semantics of a word. Thus the IDNA2003 mapping should and does delete them. That result, however, should never really be seen by users - it should be just a transient form used for comparison. Unfortunately, the way IDN works this "comparison format" (with transformations of eszett, final sigma, and deleted ZWJ/NJ) ends up being visible to the user.

There are words such as the name of the country of Sri Lanka, which require preservation of these joiners (in this case, ZWJ) in order to appear correct to the end users when the URL comes back from the DNS server.

Q. Aren't the problems with eszett and final sigma just the same as with l, I, and 1?

A. No, The eszett and sigma are fundamentally different than I,l,1. With the following (using a digit 1), all browsers will go to the same location, whether they old or new:

goog1e.com

With the following, browsers that use IDNA2003 will go to a different location than browsers that use IDNA2008, unless the registry for xx puts into place a bundle strategy.

gießen.xx

The same goes for Greek sigma, which is a more common character in Greek than the eszett is in German.

Q. Why doesn't IDNA2008 (or for that matter IDNA2003) restrict allowed domains on the basis of language?

A. It is extremely difficult to restrict on the basis of language, because the letters used in a particular language are not well defined. The "core" letters typically are, but many others are typically accepted in loan words, and have perfectly legitimate commercial and social use.

It is a bit easier to maintain a bright line based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). Even there it is problematic to have that as a restriction. Some languages (Japanese) require multiple scripts. And in most cases, mixtures of scripts are harmless. One can have SONY日本.com with no problems at all—while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.

The rough consensus among the working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA2008 is no different than IDNA2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to have a bright line. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.

Responsible registries can apply such restrictions. For example, a country-level registry can decide on a restricted set of characters appropriate for that country's languages. Application software also take certain precautions—MSIE, Safari, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. For more information on the kinds of techniques that implementations can use on the Unicode web site, see UTR#36: Unicode Security Considerations [UTR36].

Q. Doesn't the removal of symbols and punctuation in IDNA2008 help security?

A. Surprisingly, not really. It doesn't do anything about the most frequent sources of spoofing; look-alike characters that are both letters, like "paypal.com" with a Cyrillic "a". If a symbol that can be used to spoof a letter X is removed, but there is another letter that can spoof X is retained, there is no net benefit. Weighted by frequency, according to data at Google the removal of symbols and punctuation in IDNA2008 reduces opportunities for spoofing by only about 0.000016%. In another study at Google of 1B web pages, the top 277 confusable URLs used confusable letters or numbers, not symbols or punctuation. The 278th page had a confusable URL with × (U+00D7 MULTIPLICATION SIGN - by far the most common confusable); but that page could could be even better spoofed with х (U+0445 CYRILLIC SMALL LETTER HA), which normally has precisely the same displayed shape as "x".

There is a very significant security loophole in that IDNA2008 does not require a client to do any checks whatsoever on a Punycode version, like on "http://xn--iny-zx5a.com" (which contains a symbol). That is, a conformant browser "MAY" do those checks, but doesn't have to. Any such browser is completely dependent on the registry's being safe

IDNA2003 was quite clear; a specified, standardized mapping, and required checks on Punycode versions. IDNA2008 does not, and introduces Deviations on top of that.

Acknowledgements

[TBD].

References

[TBD].
(http://tools.ietf.org/id/idnabis )

Modifications

The following summarizes modifications from the previous revisions of this document.

Version 1

Draft 2:
Fixed a number of typos and problems pointed out by Marcos (mostly not noted in the text).
Added draft security and FAQ sections.
Replaced the introduction, and shortened the document overall; now that there is an NFKC_CaseFolded property, the mapping is considerably simpler.
Added specifications for the Hybrid and Compatibility implementations, including the two Modes, based on the additional material from the UTC in early 2008.

Copyright © 2009 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.