L2/10-446R2

 

From: Mark Davis

Date: Nov 4, 2010

Subject: Proposal for enhancements to UTS#46

Proposal for enhancements to UTS#46 for next Unicode version.

 

1. We've gotten feedback that it would be useful to indicate whether a character is allowed in IDNA2008 or not, and whether a test case is. The proposal is to issue a 6.0.1 version of UTS #46 with an additional optional field in the data files http://www.unicode.org/Public/idna/latest/IdnaMappingTable.txt and http://www.unicode.org/Public/idna/latest/IdnaTest.txt.

 

We define an acronym "NV8" to indicate at least one code point in a string is DISALLOWED under all versions of IDNA2008 (at or after Unicode 5.2).

 

The optional fields with this acronym will appear in the following cases:

 

For the IdnaTest.txt file, the extra field appears if the toUnicode value (field 3) contains any character that is NV8. The NV8 field does not otherwise appear.

 

Example:

B;        fass.de;

B;        xn--53h;        ☕;        xn--53h ; NV8

 

For the IdnaMappingTable.txt file, the extra field appears:

  1. if the status is "valid" or "disallowed_STD3_valid" (field 2) and the source (field 1) contains any character that is NV8.
  2. if the status is "mapped" or "disallowed_STD3_mapped" (field 2) and the mapped value (field 3) contains any character that is NV8.
  3. The NV8 field does not otherwise appear.

 

Examples:

0030..0039    ; valid                                 # 1.1         DIGIT ZERO..DIGIT NINE

00B6..00B7    ; valid     ; NV8              # 1.1         PILCROW SIGN..MIDDLE DOT

2474          ; disallowed_STD3_mapped ; 0028 0031 0029 ; NV8 # 1.1         PARENTHESIZED DIGIT ONE

3260          ; mapped       ; 1100   ; NV8              # 1.1         CIRCLED HANGUL KIYEOK

 

We will note that NV8 does not apply the BIDI, CONTEXTO, or CONTEXTJ tests, since those need to be applied to the complete context of a label.

 

 

2. It would be useful to generate a more comprehensive set of test cases, like we do for collation. It could include one sample for:

  1. each assigned character (a sample in the case of large scripts)
  2. each multi-character nfd and nfkd decomposition
  1. + an interior combining mark
  2. + a trailing combining mark