

From: Mark Davis

Date: Nov 4, 2010

Subject: Proposal for enhancements to UTS#46

Proposal for enhancements to UTS#46 for next Unicode version.


1. We've gotten feedback that it would be useful to indicate whether a character is allowed in IDNA2008 or not, and whether a test case is. The proposal is to issue a 6.0.1 version of UTS #46 with an additional optional field in the data files http://www.unicode.org/Public/idna/latest/IdnaMappingTable.txt and http://www.unicode.org/Public/idna/latest/IdnaTest.txt.


We define an acronym "NV8" to indicate at least one code point in a string is DISALLOWED under all versions of IDNA2008 (at or after Unicode 5.2).


The optional fields with this acronym will appear in the following cases:


For the IdnaTest.txt file, the extra field appears if the toUnicode value (field 3) contains any character that is NV8. The NV8 field does not otherwise appear.



B;        fass.de;

B;        xn--53h;        ☕;        xn--53h ; NV8


For the IdnaMappingTable.txt file, the extra field appears:

  1. if the status is "valid" or "disallowed_STD3_valid" (field 2) and the source (field 1) contains any character that is NV8.
  2. if the status is "mapped" or "disallowed_STD3_mapped" (field 2) and the mapped value (field 3) contains any character that is NV8.
  3. The NV8 field does not otherwise appear.



0030..0039    ; valid                                 # 1.1         DIGIT ZERO..DIGIT NINE

00B6..00B7    ; valid     ; NV8              # 1.1         PILCROW SIGN..MIDDLE DOT

2474          ; disallowed_STD3_mapped ; 0028 0031 0029 ; NV8 # 1.1         PARENTHESIZED DIGIT ONE

3260          ; mapped       ; 1100   ; NV8              # 1.1         CIRCLED HANGUL KIYEOK


We will note that NV8 does not apply the BIDI, CONTEXTO, or CONTEXTJ tests, since those need to be applied to the complete context of a label.



2. It would be useful to generate a more comprehensive set of test cases, like we do for collation. It could include one sample for:

  1. each assigned character (a sample in the case of large scripts)
  2. each multi-character nfd and nfkd decomposition
  1. + an interior combining mark
  2. + a trailing combining mark