[Unicode]  Technical Reports
 

Draft Unicode Technical Standard #46

Unicode IDNA Compatibility Processing

Version 5.2.0 (draft 5)
Authors Mark Davis (markdavis@google.com), Michel Suignard
Date 2009-11-12
This Version http://www.unicode.org/reports/tr46/tr46-2.html
Previous Version http://www.unicode.org/reports/tr46/tr46-1.html
Latest Version http://www.unicode.org/reports/tr46/
Revision 2

Summary

This document provides a specification for processing that provides for compatibility between older and newer versions of internationalized domain names (IDN) for lookup in client software. It allows applications such as browsers and emailers to be able to handle both the original version of internationalized domain names (IDNA2003) and the newer version (IDNA2008) compatibly, avoiding possible interoperability and security problems.

[Review Note: At this point, IDNA2008 is still in development, so this draft may change as IDNA2008 changes.]

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium.  This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents




1. Introduction

One of the great strengths of domain names is universality. With http://Apple.com, you can get to Apple's website no matter where you are in the world, and no matter which browser you are using. With markdavis@google.com, you can send an email to an author of this specification, no matter which country you are in, and no matter which emailer you are using.

Initially, domain names were restricted to only handling ASCII characters. This is was a significant burden on people using other characters. Suppose, for example, that the domain name system had been invented by Greeks, and one could only use Greek characters in URLs. Rather than apple.com, one would have to write something like αππλε.κομ. An English speaker would not only have to be acquainted with Greek characters, but would also have to pick those Greek letters that would correspond to the desired English letters. One would have to guess at the spelling of particular words, because there are not exact matches between scripts.

A large majority of the world’s population faced this situation until recently, because their languages use non-ASCII characters.

1.1 IDNA2003

A system was introduced in 2003 for internationalized domain names (IDN). This system is called Internationalizing Domain Names for Applications, or IDNA for short. It consists of a series of RFCs collectively known as IDNA2003 [IDNA2003]. This system allows non-ASCII Unicode characters, which includes not only the characters needed for Latin-script languages other than English (such as Å, Ħ, or Þ), but also different scripts, such as Greek, Cyrillic, Tamil, or Korean.

The IDNA mechanism for allowing non-ASCII Unicode characters in domain names involves applying the following steps to each label in the domain name that contains Unicode characters:

  1. Transforming (mapping) a Unicode string to remove case and other variant differences.
  2. Checking the resulting mapped string for validity, according to certain rules.
  3. Transforming the Unicode characters into a DNS-compatible ASCII string using a specialized encoding called Punycode [RFC3492].

For example, you can now type in http://Bücher.de into the address bar of any modern browser, and you will go to a corresponding site, even though the "ü" is not an ASCII character. This works because the IDN resolves to the Punycode string which is actually stored by the DNS for that site. Similarly, when a browser interprets a web page containing a link such as <a href="http://Bücher.de">, the appropriate site is reached. (In this document, when phrasing like "a browser interprets" is used, it refers both to domain names parsed out of URLs entered in an address bar and to those contained in links internal to HTML text.)

In this case, for the IDN Bücher.de, the Punycode value actually used for the domain names on the wire is http://xn--bcher-kva.de. The Punycode version is also typically transformed back into Unicode form for display. The resulting display string will be a string which has already been mapped according to the IDNA2003 rules. So in this example we end up with a display string that has been casefolded to lowercase:

http://Bücher.de http://xn--bcher-kva.de http://bücher.de

1.2 IDNA2008

There is a new version of IDNA under development. This version also consists of a collection of RFCs and is usually called IDNA2008 [IDNA2008]. The "2008" in that term does not reflect the actual date of approval, which is still pending and expected to occur in late 2009 or early 2010.

[Review note: reword above when 2008 is released.]

For the most common cases, the processing in IDNA2003 and IDNA2008 are identical. Both transform a Unicode domain name in a URL (like http://öbb.at) to the Punycode version (like http://xn--bb-eka.at). However, IDNA2008 does not maintain strict backwards compatibility with IDNA2003.

The main differences between the two are:

For more detail on the differences, see Section 8, IDNA Comparison.

1.3 Security Considerations

The cases of deviations and unpredictable changes introduced by the differences between IDNA2008 and IDNA2003 may cause both interoperability and security problems. They affect extremely common characters, such as all uppercase characters, all half-width or full-width characters (commonly used in Japan, China, and Korea), and certain other characters like the German eszett (U+00DF ß LATIN SMALL LETTER SHARP S) and Greek final sigma (U+03C2 ς GREEK SMALL LETTER FINAL SIGMA).

IDNA2003 requires a mapping phase, which maps http://ÖBB.at to http://öbb.at, for example. Mapping typically involves mapping uppercase characters to their lowercase pairs, but it also involves other types of mappings between equivalent characters, such as mapping half-width katakana characters to normal (full-width) katakana characters in Japanese. The mapping phase in IDNA2003 was included to match the insensitivity of ASCII domain names. Users are accustomed to having both http://CNN.com and http://cnn.com work identically. They would not expect the addition of an accent to change the casing behavior: they expect that if http://Bruder.com is the same as http://bruder.com, then of course http://Brüder.com is the same as http://brüder.com. In other scripts there are variations in characters similar to case in this respect. The IDNA2003 mapping is based on data specified by Unicode; this mapping was later formalized as the Unicode property [NFKC_CaseFold].

IDNA2008 does not require a mapping phase, but does permit one (called "Local Mapping" or "Custom Mapping"), with no limitation on what the mapping can do to disallowed characters. Disallowed characters even include ASCII uppercase characters, if they occur in an IDN label. For more information on the permitted mappings, see the Protocol document of [IDNA2008], Section 4.2 Permitted Character and Label Validation and Section 5.2 Conversion to Unicode. An implementation of IDNA2008 which uses the option of Custom Mapping can, in principle, allow any particular mapping. Such mappings can have unpredictable results regarding the exact interpretation of the processed IDNs. For example, the following mappings show cases where IDNs are mapped to what would be considered completely different domain names by IDNA2003 rules:

[Review note: Fix the numbers/titles for the Protocol document if they change before 2008 is released.]

  1. Map http://ÖBB.at to http://öbb.at
  2. Map http://ÖBB.at to http://oebb.at
  3. Map http://TÜRKIYE.com to http://türkiye.com
  4. Map http://TÜRKIYE.com to http://türkıye.com

Note that there is a dotless i in the result of the mapping illustrated in #4. This has the consequence that the mapped IDN resolves to a different location than the mapped IDN in #3.

IDNA2008 does define a particular mapping. That mapping is not normative, and does not attempt to be compatible with IDNA2003. For more information, see the Mapping document in [IDNA2008].

1.3.1 Deviations

There are a few situations where the strict application of IDNA2008 will result in the resolution of IDNs to different IP addresses than in IDNA2003, unless the registry or registrant takes special action. This affects a relatively small number of characters, but these characters are common in particular languages. Because of this common occurrence, a significant number of strings for domain names are affected in those languages. This set of characters is referred to as "Deviations". There are four of these, as shown in Table 1, Deviation Characters.

Table 1. Deviation Characters
Char
Example
IDNA2003 Result
IDNA2008 Result
ß
00DF
href="http://faß.de"
http://fass.de
= http://fass.de
http://faß.de
= http://xn--fa-hia.de
ς
03C2
href="http://βόλος.com"
http://βόλοσ.com
= http://xn--nxasmq6b.com
http://βόλος.com
= http://xn--nxasmm1c.com
ZWJ
200D
href="http://ශ්ර‍ී.com"
http://ශ්රී.com
= http://xn--10cl1a0b.com
http://ශ්ර‍ී.com
= http://xn--10cl1a0b760p.com
ZWNJ
200C
href="http://نامه‌ای.com"
http://نامهای.com
= http://xn--mgba3gch31f.com
http://نامه‌ای.com
= http://n--mgba3gch31f060k.com

For more information on the rationale for the the occurrence of these Deviations in IDNA2008, see the [IDN FAQ].

The differences in interpretation of Deviation characters results in the potential for security exploits. Consider a scenario involving http://www.sparkasse-gießen.de, a German IDN for "Gießen Savings and Loan".

  1. Alice's browser supports IDNA2003. Under those rules, http://www.sparkasse-gießen.de is mapped to http://www.sparkasse-giessen.de, which leads to a site with the IP address 01.23.45.67.
  2. She visits her friend Bob, and checks her bank statement on his browser. His browser supports IDNA2008. Under those rules, http://www.sparkasse-gießen.de is also valid, but converts to a different Punycode domain name in http://www.xn--sparkasse-gieen-2ib.de. This can lead to a different site with the IP address 101.123.145.167, a spoof site.

Alice ends up at the phishing site, supplies her bank password, and is her money is stolen. While the .DE registar (DENIC) might have a policy about bundling all of the variants of ß together (so that they all have the same owner) it is not required of registries. It is quite unlikely that all registries will have or enforce such a bundling policy in all such cases.

There are two Deviations of particular concern. IDNA2008 allows ZWJ and ZWNJ characters in labels. By contrast these are removed by the mapping in IDNA2003. In addition to this difference in mapping, these characters represent a special security concern because they are normally invisible. That is, the sequence "a<ZWJ>b" looks just like "ab". IDNA2008 provides a special category called CONTEXTJ for ZWJ and ZWNJ, and only permits them to occur in certain contexts: certain sequences of Arabic or Indic characters. However, lookup applications are not required to check for these contexts, so overall security is dependent on registries having correct implementations. Moreover, those context restrictions do not catch all cases where distinct domain names have visually confusable appearances because of ZWJ and ZWNJ.

2 Unicode IDNA Compatibility Processing

To allow client-side applications to work around the incompatibilities between IDNA2003 and IDNA2008 for lookup, this document provides a Unicode algorithm for a standardized processing that allows conformant implementations to minimize the security and interoperability problems caused by the differences between IDNA2003 and IDNA2008. This Unicode IDNA Compatibility Processing is structured according to IDNA2003 principles, but extends those principles to Unicode 5.1 and later. In so doing, it also incorporates the repertoire extensions provided by IDNA2008.

The Unicode IDNA Compatibility Processing uses the standard Unicode mapping, [NFKC_CaseFold], for the mapping described in this document. As a result, the domain name in http://ÖBB.at is valid, and maps to http://öbb.at. It also allows domain names as in http://√.com (which has an associated web page), and are allowed in IDNA2003. Based on security considerations, implementations may restrict or flag (in a UI) domain names that include symbols and punctuation. For more information, see UTR#36: Unicode Security Considerations [UTR36].

The result of this Compatibility Processing is a series of labels, each separated by U+002E ( . ) FULL STOP. For DNS lookup, the result of the Compatibility Processing is transformed by Punycoding each label that contains non-ASCII.

Using the Unicode IDNA Compatibility Processing to transform an IDN into a form suitable for DNS lookup is comparable to the tactic of "try IDNA2008 then try IDNA2003". However, this approach avoids a dual lookup, which can be very problematic. It allows browsers and other clients such as search engines to have a single processing step, without having to maintain two different implementations and multiple tables. It accounts for a number of edge cases that would cause problems, and provides a stable definition with predictable results that will remain absolutely backwards compatible in future versions of Unicode.

For a demonstration of differences between IDNA2003, IDNA2008, and the Unicode IDNA Compatibility Processing, see the [IDN_Demo].

This document provides a compatibility mechanism for dealing with IDNA domain name lookup and display. To this end, specifies two specific types of processing: Lookup Processing and Display Processing. It does not deal with the registration of IDNs.

Note that neither the Unicode IDNA Compatibility Processing nor IDNA2008 address security problems associated with confusables (the so-called "paypal.com" problem). IDNA2008 does disallow certain symbols and punctuation characters that can be used for spoofing, such as spoofs of the slash character ("/"). These are, however, an extremely small fraction of the confusable characters used for spoofing. Moreover, confusable characters themselves account for a small proportion of fishing problems: most are cases like "secure-wellsfargo.com". For more information, see [Bortzmeyer].

It is strongly recommended that UTR#36: Unicode Security Considerations [UTR36] be consulted for information on dealing with confusables.

2.1 Display of Internationalized Domain Names

For IDNA2003 applications, it has been customary to display the processed string to the user. This is helpful for security, because it reduces the opportunity for visual confusability. Thus, for example, http://googIe.com (with a capital I in place of the L) is revealed as http://googie.com. However, for the case of the Deviations, the distinction between the original and processed form is especially important for users. Thus in displaying domain names, it is recommended that the Display Processing be applied. This is the same as Lookup Processing, except that it excludes the deviations: ß, ς, and joiners.

Labels presented to a browser may or may not be in the display form preferred by a target site; for more information see the [IDN FAQ]. This specification defines a default display algorithm in Section 4, Processing.

2.2 Registries

This specification is primarily targeted at applications doing Lookup Processing for IDNs. There is, however, one strong recommendation for registries: do not allow the registration of labels that are invalid according to Lookup Processing. The registration of such a label would not be found by browsers and search engines following Unicode IDNA Compatibility Processing.

The label that is actually registered and inserted into a registry, is always a label that has been processed. For example, http://xn--bcher-kva.de which corresponds to http://bücher.de. However, it may be useful for a registry to also ask for "unprocessed" labels as part of the registration process, such as http://Bücher.de, so that they are aware of the registrant's intent. However, such unprocessed labels must be handled carefully:

2.3 Notation

Sets of code points are defined using properties and the syntax of UTS#18: Unicode Regular Expressions [UTS18]. For example, the set of combining marks is represented by the syntax \p{gc=M}. An additional syntactic notation beyond the syntax of UTS#18 is used here: the "+" indicates the addition of elements to a set.

In this document, a label is a substring of a domain name. That substring is bounded on both sides by either the start or the end of the string, or any of the following characters, called label-separators:

  1. U+002E ( . ) FULL STOP
  2. U+FF0E ( . ) FULLWIDTH FULL STOP
  3. U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
  4. U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP

3 Conformance

The requirements for conformance on implementations of the Unicode IDNA Compatibility Processing algorithm are as follows:

C1 Given a version of Unicode and a Unicode String, a conformant implementation of Lookup Processing shall replicate the results given by applying the Lookup Processing algorithm specified by Section 4, Processing.
C2 Given a version of Unicode and a Unicode String, a conformant implementation of Display Processing shall replicate the results given by applying the Display Processing algorithm specified by Section 4, Processing.

These specifications are logical ones, designed to be straightforward to describe. An actual implementation is free to use different methods as long the result is the same as the result specified by the logical algorithm.

Any conformant implementation may also have tighter validity criteria than those imposed by Section 6, Validity Criteria. For example, an application could disallow or warn of domain name labels with certain characteristics. For example:

For more information, see UTR#36: Unicode Security Considerations [UTR36].

4 Processing

The input to Unicode IDNA Compatibility Processing is a prospective domain_name string expressed in Unicode. The domain name consists of a sequence of labels with dot separators, such as "Bücher.de".

Note: For more information about the composition of a URL, see Section 3.5 of [RFC1034].

The input domain_name string must have had all escaped Unicode code points converted to Unicode code points. For example, U+5341 ( 十 ) CJK UNIFIED IDEOGRAPH-5341 could have been escaped as any of the following:

The following steps, performed in order, successively alter the input domain_name string and then output it as a converted Unicode string.

  1. For each code point in the domain_name string, lookup the status value in Section 5, IDNA Mapping Table, and take the following actions:
    • disallowed: Abort with an error.
    • ignored: Remove the code point from the string. This is equivalent to mapping the code point to an empty string.
    • mapped: Replace the code point in the string by the value for the mapping in Section 5, IDNA Mapping Table.
    • deviation:
      • For Lookup Processing, replace the code point in the string by the value for the mapping in Section 5, IDNA Mapping Table.
      • For Display Processing, leave the code point unchanged in the string.
    • valid: Leave the code point unchanged in the string.
  2. Normalize the domain_name string to Unicode Normalization Form C.
  3. For each label in the domain_name string
    1. If the label starts with "xn--", attempt to convert the rest of the label to Unicode according to Punycode [RFC3492].
    2. If that conversion fails, abort with an error.
    3. If that conversion succeeds, replace the original label in the string by the results of the conversion.
  4. For each label in the domain_name string, verify that it meets the validity criteria in Section 6, Validity Criteria. If any of the validity criteria are not satisfied, abort with an error.

Any input domain_name string that does not abort with an error in the application of these steps is valid according to this specification. Conversely, if an input domain_name string causes an error, then that input input domain_name string is not valid. The processing is idempotent—reapplying the processing to the output will make no further changes. For examples, see Table 2, Examples of Lookup Processing.

There are two types of processing: Lookup Processing and Display Processing. These differ only in how the deviation code points in the mapping table are handled. The result of Lookup processing can be converted to a domain name string containing Punycode labels ("asciified").

Note: Some browsers allow also characters such as underscore ("_") in domain names. Any such extension is outside of the scope of this document.

Implementations are advised to apply additional tests to these labels such as those described in UTR#36: Unicode Security Considerations [UTR36], and take appropriate actions. For example, a label with mixed scripts or confusables may be called out in the UI.

Table 2. Examples of Lookup Processing
Input Step 1 Step 2 Step 3 Step 4
Comment
Bloß.de
bloss.de
=
=
valid
maps uppercase and eszett
u¨.com
=
ü.com
=
valid
normalizes u + umlaut
xn--tda.com
xn--tda.com
=
ü.com
valid
xn--tda = ü
xn--u-ccb.com
=
=
u¨.com
error
xn--u-ccb = u + umlaut
a⒈com
error
⒈is not valid
xn--a-ecp.ru
xn--a-ecp.ru
=
a⒈.ru
error
xn--a-ecp = a⒈
xn--a.pt
xn--a.pt
=
error
invalid Punycode
日本語。JP
日本語.jp
=
=
valid
mapping full width characters
☕.us
=
=
=
valid
post Unicode 3.2 character

4.1 Implementation Notes

There are a number of optimizations can be applied to this processing. These optimizations can improve performance, reduce table size, make use of existing NFKC transform mechanisms, and so on. For example:

5 IDNA Mapping Table

For each code point in Unicode, the IDNA Mapping Table provides a status value. If this status value is mapped or deviation, the table also supplies a mapping value for that code point. A table is provided for each version of Unicode starting with Unicode 5.1, in versioned directories under [IDNA-Table]. Each table for a version of the Unicode Standard will always be backwards compatible with previous versions of the table: only characters with the status value disallowed may change in status or mapping value.

A description of the derivation of these tables is provided in Section 7, Mapping Table Derivation. As for derived properties in the Unicode Character Database, the description of the derivation is informative. Only the data in IDNA Mapping Table is normative for the application of this specification.

The files use a semicolon-delimited format similar to those in the Unicode Character Database. The first field is the code point; the second field is the status value; and the third field is the mapping value. Code points are expressed in hexadecimal. The status values are one of the following five values: valid, disallowed, ignored, mapped, and deviation.

Example:

0000..002C    ; disallowed                    #  NULL..COMMA
002D ; valid # HYPHEN-MINUS
...
0041 ; mapped ; 0061 # LATIN CAPITAL LETTER A ...
00AD ; ignored # SOFT HYPHEN ... 00DF ; deviation ; 0073 0073 # LATIN SMALL LETTER SHARP S
...

6 Validity Criteria

Each of the following criteria must be satisfied for a label to be valid:

  1. The label must contain at least one code point.
  2. The label must not contain a U+002D HYPHEN-MINUS character in both the third position and fourth positions.
  3. The label must neither begin nor end with a U+002D HYPHEN-MINUS character.
  4. The label must be in Unicode Normalization Form NFC.
  5. The label must not contain a U+002E ( . ) FULL STOP.
  6. Each code point in the label must only have certain status values according to Section 5, IDNA Mapping Table:
    1. For Lookup Processing, each value must be valid.
    2. For Display Processing, each value must be either valid or deviation.
  7. The label must not begin with a combining mark, that is: General_Category=Mark.

In addition, the label should meet the requirements for right-to-left characters specified in the Bidi document of [IDNA2008].

Any particular application may have tighter validity criteria, as discussed in Section 3, Conformance.

7 Mapping Table Derivation

The following describes the derivation of the mapping table. Step 1 defines a base mapping value; Steps 2-4 define three sets of characters. These are all used in Step 5 to produce the mapping and status values for the table.

If a Unicode property were to change in a future version in a way that would affect backwards compatibility, a grandfathering clause will be added to maintain compatibility. For more information on compatibility, see Section 5, IDNA Mapping Table.

Step 1: Produce a base mapping value

  1. Map the following label separator characters to U+002E ( . ) FULL STOP
    1. U+FF0E ( . ) FULLWIDTH FULL STOP
    2. U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
    3. U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP
  2. Map each other character to its NFKC_CaseFold value [NFKC_CaseFold].

Step 2: Specify the base valid set

The base valid set is defined by the sequential list of additions and subtractions in Table 3, Base Valid Set. This definition is based on the principles of IDNA2003. When applied to the repertoire of Unicode 3.2 characters, this produces a set which is closely aligned with IDNA2003.

Table 3. Base Valid Set
Formal Sets Descriptions
[ \P{Changes_When_NFKC_Casefolded}
Start with characters that are NFKC Case folded (excluding uppercase, for example). Note that \P means the inverse of \p, so these are the characters that don't change when individually NFKC_CaseFolded.
- \p{c} - \p{z} Remove Control Characters and Whitespace
- \p{Block=Ideographic_Description_Characters} Remove ideographic description characters
- \p{ascii} Remove ASCII
+ [\u002D\u002Ea-zA-Z1-0] Add back all the valid ASCII, plus
U+002E ( . ) FULL STOP

Step 3: Specify the base exclusion set

The exclusion set consists of characters that have a different mapping in IDNA2003 than the base mapping value specified in Step 1. For more information, see the [IDN FAQ]. For this version, the exclusion set consists of the following:

Step 4: Specify the deviation set

This is the set of characters that deviate between IDNA2003 and IDNA2008.

U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER
U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA

Step 4: Produce the status and mapping values for the table

For each code point:

  1. If the code point is in the deviation set
    • the status is deviation and the mapping value is the base mapping value for that code point
  2. Otherwise, if (a) the code point is in the base exclusion set, or if (b) any code point in its base mapping value is not in the base valid set
    • the status is disallowed and there is no mapping value in the table
  3. Otherwise, if the base mapping value is an empty string
    • the status is ignored and there is no mapping value in the table
  4. Otherwise, if the base mapping value is the same as the code point
    • the status is valid and there is no mapping value in the table
  5. Otherwise,
    • the status is mapping and the mapping value is the base mapping value for that code point

Note that characters such as U+2488 ( ⒈ ) DIGIT ONE FULL STOP fall under (2a).

8 IDNA Comparison

Table 4, IDNA Comparisons illustrates the differences between the three specifications in terms of valid character repertoire. It omits all code points unassigned in Unicode 5.2, as well as the ASCII-repertoire code points, because the specifications treat all of those identically. The table has separate groupings for Unicode 3.2 (the only characters valid in IDNA2003) and beyond. It also separates groups buckets where UTS46 and IDNA2008 behave the same from those where they behave differently.

Each row in the table defines a bucket of code points that share a pattern of behavior across the three specifications. The columns provide the following information:

Table 4. IDNA Comparisons
Count
IDNA2003
UTS46
IDNA2008
Comments and Samples
Unicode v3.2 (UTS46 = IDNA2008)
86,676
Valid
Valid
Valid
Valid in all three systems
U+00E0 ( à ) LATIN SMALL LETTER A WITH GRAVE
432
Disallowed
Disallowed
Disallowed
Disallowed in all three systems
U+FF01 ( ! ) FULLWIDTH EXCLAMATION MARK
52
Valid / Mapped
Disallowed
Disallowed
Mappings changed after v3.2
U+2132 ( Ⅎ ) TURNED CAPITAL F
Unicode v3.2 (UTS46 ≠ IDNA2008)
4,639
Mapped / Ignored
Mapped / Ignored
Disallowed
Case and compatibility variants, default ignorables
U+00C0 ( À ) LATIN CAPITAL LETTER A WITH GRAVE
3,258
Valid
Valid
Disallowed
Punctuation, Symbols, etc.
U+2665 ( ♥ ) BLACK HEART SUIT
4
Mapped / Ignored
Mapped / Ignored
Valid
Deviations
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER
U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
Unicode v4.0-5.2 (UTS46 = IDNA2008)
9,705
Valid*
Valid
Valid
U+0221 ( ȡ ) LATIN SMALL LETTER D WITH CURL
53
Valid*
Disallowed
Disallowed
U+0602 ( ؂ ) ARABIC FOOTNOTE MARKER
Unicode v4.0-5.2 (UTS46 ≠ IDNA2008)
1,592
Valid*
Valid
Disallowed
U+2615 ( ☕ ) HOT BEVERAGE
790
Valid*
Mapped / Ignored
Disallowed
U+023A ( Ⱥ ) LATIN CAPITAL LETTER A WITH STROKE

[Review Note: The table will need to be regenerated for Unicode 5.2 and for any changes are made in IDNA2008 before it goes final. It may also need some tweaking for label separators.]

Acknowledgements

For their contributions of ideas or text to this specification, thanks to Matitiahu Allouche, Peter Constable, Craig Cummings, Martin Dürst, Peter Edberg, Deborah Goldsmith, Laurentiu Iancu, Gervase Markham, Simon Montagu, Lisa Moore, Eric Muller, Murray Sargent, Markus Scherer, Jungshik Shin, Shawn Steele, Erik van der Poel, Chris Weber, and Ken Whistler. The specification builds upon [IDNA2008], developed in the IETF Idnabis working group, especially contributions from Matitiahu Allouche, Harald Alvestrand, Vint Cerf, Martin J. Dürst, Lisa Dusseault, Patrik Fältström, Paul Hoffman, Cary Karp, John Klensin, and Peter Resnick, and also upon [IDNA2003], authored by Marc Blanchet, Adam Costello, Patrik Fältström, and Paul Hoffman.

References

References not listed here may be found in http://www.unicode.org/reports/tr41/#UAX41.

[Bortzmeyer] http://www.bortzmeyer.org/idn-et-phishing.html (machine translated at http://translate.google.com/translate?u=http%3A%2F%2Fwww.bortzmeyer.org%2Fidn-et-phishing.html)
[Feedback] Reporting Errors and Requesting Information Online
http://www.unicode.org/reporting.html
[IDNA2003] The IDNA2003 specification is defined by a cluster of IETF RFCs:
[IDNA2008] The draft IDNA2008 specification is defined by a cluster of IETF RFCs:

For more information, see http://tools.ietf.org/id/idnabis.

[Review Note: Fix the references once IDNA2008 is final, and use the formal titles references in the text.]

[IDNA-Table] http://www.unicode.org/Public/idna
[IDN-Demo] http://unicode.org/cldr/utility/idna.jsp
[IDN-FAQ] http://www.unicode.org/faq/idn.html
[NFKC_CaseFold] The Unicode property specified in [UAX44], and defined by the data in DerivedNormalizationProps.txt (search for "NFKC_CaseFold").
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[RFC1034] P. Mockapetris. "DOMAIN NAMES - CONCEPTS AND FACILITIES", RFC1034, November 1987
http://tools.ietf.org/html/rfc1034
[RFC3454] P. Hoffman, M. Blanchet. "Preparation of Internationalized Strings ("stringprep")", RFC 3454, December 2002.
http://ietf.org/rfc/rfc3454.txt
[RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003.
http://ietf.org/rfc/rfc3490.txt
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003.
http://ietf.org/rfc/rfc3491.txt
[RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003.
http://ietf.org/rfc/rfc3492.txt
[SafeBrowsing] http://code.google.com/apis/safebrowsing/
[Unicode] The Unicode Standard
For the latest version see:
http://www.unicode.org/versions/latest/.
[Versions] Versions of the Unicode Standard
http://www.unicode.org/versions/
For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

The following summarizes modifications from the previous revisions of this document.

Revision 2

Revision 1