Normalization Issue

L2/04-094

Re:	Normalization Issue
From:	Mark Davis, Martin Duerst
Date:	2003-02-04

Issue

We just became aware of a problem with the language in definition D2 in the specification of UAX #15 Normalization Forms. Definition D2 defines what it means for a character to be blocked with the following text:

D2. In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same combining class as C.

The implementations that were used to develop normalization, the original design, and sample code in UAX #15 and charlint are actually the following (the missing wording was a glitch that escaped examination):

D2'. In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same or higher combining class as C.

The following table shows the difference between D2 and D2', where k and i are nonzero canonical combining classes (ccc).

Table 1: Differences
ccc(B)	ccc(C)	B blocks C?	Comments
0	0	irrelevant	impossible: B becomes S as soon as it is seen
0	k	irrelevant	impossible: B becomes S as soon as it is seen
i	0	yes	was no in D2
i	k = i	yes	avoids affecting ordering in both cases
i	k > i	no	allows combination in both cases
i	i > k	irrelevant	impossible after canonical reordering

Frequency of Occurrence

Fortunately, the difference between these is limited to sequences that should never occur in real data. The only situations where there is a difference is where there are two starters that would primary-combine, but happen to be separated by a non-starter. The two starters can only those in the table Double Starters (at the end of this document). An important feature about these combinations are that in well-formed text in any language, no non-starters will occur between the possible pairs of starter characters that primary combine. For example, one never sees the sequence:

U+1100 (ᄀ) HANGUL CHOSEONG KIYEOK + U+0300 (◌̀) COMBINING GRAVE ACCENT + U+1161 (ᅡ) HANGUL JUNGSEONG A

According to the old D2, the NFC form of this would be U+AC00 (가) HANGUL SYLLABLE GA + U+0300 (◌̀) COMBINING GRAVE ACCENT; that is, the second starter is not blocked from the first, and combines with it. D2' prevents this, and the NFC form stays the same.

Although one would never hit these sequences in real data, formally it is important to correct the definition for two reasons.

Unless the definition is fixed, NFC and NFKC do not obey canonical equivalence for these sequences.
- i.e., there are some x and y where toNFD(x) ≠ toNFD(y), but toNFC(x) = toNFC(y)
Unless the definition is fixed, NFC and NFKC are not idempotent, and thus not a folding.
- i.e., there is some x such that toNFC(toNFC(x)) ≠ toNFC(x)

An example of where D2 causes failure of idempotency is where x is sequence (A) above followed by U+0323 (◌̣) COMBINING DOT BELOW. The first NFC produces

U+AC00 (가) HANGUL SYLLABLE GA + U+0300 (◌̀) COMBINING GRAVE ACCENT + U+0323 (◌̣) COMBINING DOT BELOW.

The second NFC reverses the order of the accents, producing

U+AC00 (가) HANGUL SYLLABLE GA + U+0323 (◌̣) COMBINING DOT BELOW + U+0300 (◌̀) COMBINING GRAVE ACCENT.

Idempotency is important for consistent lookup. Requesting A and getting B, and then requesting B and getting C is clearly something to be avoided. Idempotency is also required so that isNFC(toNFC(x)).

Recommendations

Correct D2 to D2'.
Issue a corrigendum for previous versions of Unicode, so that people running on those versions of Unicode who want to correct their implementations can do so.
In the NormalizationTest data file, add test cases for each of the rows in Double Starters.
Provide a mechanism for protocols that want to keep the old behavior and yet update to new versions of Unicode (with new characters). See below.
Before making any changes, however, issued in a Public Review Issue to give important groups like IETF a chance to comment.

Implementations Surveyed

For all implementations that we have been able to review, either the code follows the example of the sample code, and needs no changes, OR the code change is quite small, typically converting a not-equals (!=) to a less-than (<) on one line of code. As discussed above, in practice no data should be affected.

Test Cases

To see whether an implementation has the problem, check the following cases to make sure that they do not change after applying NFC.

<U+0B47; U+0300; U+0B3E>
<U+01100; U+0300; U+01161>

Implementations Surveyed

http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/dist/src/Modules/unicodedata.c: needs examination
libidn ( http://josefsson.org/libidn/ ): needs examination
ICU: needs change
charlint, perl (http://dev.w3.org/cvsweb/charlint/charlint.pl): no change necessary
idnkit ( http://www.nic.ad.jp/ja/idn/idnkit/download/ ): needs change
xml1.1test normalization checking code ( http://dev.w3.org/cvsweb/charlint/xml1.1test/ ): needs change

Other Approaches

If a format (such as for identifiers) uses NFC or NFKC and wants to remain stable across the change, one possible approach is to forbid sequences of the form <non-starter, second-starter>, where the second-starters are defined in the second column of the table Double Starters below. The second-starters are also defined in ComposingChars in the W3C Character Model for the World Wide Web 1.0. (The public Working Draft 22 August 2003 version of the table mistakenly includes 0FB5 and 0FB7. Those are to be fixed in the next version.) These sequences should never appear in real data, so the restriction does not cause any lack of expressiveness.

Alternatively, the format could remain conservative, and continue to use the old version of D2 even after updating to Unicode 4.0.1 or later. In that case, when updating, the format will claim conformance to Unicode Normalization as modified by the use of Unicode 4.0.0 UAX #15 D2. (We may provide a better way to phrase this.)

Table 2: Double Starters
First Starter	Second Starter
09C7 BENGALI VOWEL SIGN E	09BE BENGALI VOWEL SIGN AA or 09D7 BENGALI AU LENGTH MARK
0B47 ORIYA VOWEL SIGN E	0B3E ORIYA VOWEL SIGN AA or 0B56 ORIYA AI LENGTH MARK or 0B57 ORIYA AU LENGTH MARK
0BC6 TAMIL VOWEL SIGN E	0BBE TAMIL VOWEL SIGN AA or 0BD7 TAMIL AU LENGTH MARK
0BC7 TAMIL VOWEL SIGN EE	0BBE TAMIL VOWEL SIGN AA
0B92 TAMIL LETTER O	0BD7 TAMIL AU LENGTH MARK
0CC6 KANNADA VOWEL SIGN E	0CC2 KANNADA VOWEL SIGN UU or 0CD5 KANNADA LENGTH MARK or 0CD6 KANNADA AI LENGTH MARK
0CBF KANNADA VOWEL SIGN I or 0CCA KANNADA VOWEL SIGN O	0CD5 KANNADA LENGTH MARK
0D47 MALAYALAM VOWEL SIGN EE	0D3E MALAYALAM VOWEL SIGN AA
0D46 MALAYALAM VOWEL SIGN E	0D3E MALAYALAM VOWEL SIGN AA or 0D57 MALAYALAM AU LENGTH MARK
1025 MYANMAR LETTER U	102E MYANMAR VOWEL SIGN II
0DD9 SINHALA VOWEL SIGN KOMBUVA	0DCF SINHALA VOWEL SIGN AELA-PILLA or 0DDF SINHALA VOWEL SIGN GAYANUKITTA
1100..1112 HANGUL CHOSEONG KIYEOK..HIEUH [19 instances]	1161..1175 HANGUL JUNGSEONG A..I [21 instances]
[:HangulSyllableType=LV:]	11A8..11C2 HANGUL JONGSEONG KIYEOK..HIEUH [27 instances]

Issue

Table 1: Differences

Frequency of Occurrence

Recommendations

Implementations Surveyed

Table 2: Double Starters