L2/04-094
Re: | Normalization Issue |
From: | Mark Davis, Martin Duerst |
Date: | 2003-02-04 |
We just became aware of a problem with the language in definition D2 in the specification of UAX #15 Normalization Forms. Definition D2 defines what it means for a character to be blocked with the following text:
D2. In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same combining class as C.
The implementations that were used to develop normalization, the original design, and sample code in UAX #15 and charlint are actually the following (the missing wording was a glitch that escaped examination):
D2'. In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same or higher combining class as C.
The following table shows the difference between D2 and D2', where k and i are nonzero canonical combining classes (ccc).
ccc(S) | ccc(B) | ccc(C) | B blocks C? | Comments |
---|---|---|---|---|
0 | 0 | 0 | irrelevant | impossible: B becomes S as soon as it is seen |
0 | 0 | k | irrelevant | impossible: B becomes S as soon as it is seen |
0 | i | 0 | yes | was no in D2 |
0 | i | k = i | yes | avoids affecting ordering in both cases |
0 | i | k > i | no | allows combination in both cases |
0 | i | i > k | irrelevant | impossible after canonical reordering |
Fortunately, the difference between these is limited to sequences that should never occur in real data. The only situations where there is a difference is where there are two starters that would primary-combine, but happen to be separated by a non-starter. The two starters can only those in the table Double Starters (at the end of this document). An important feature about these combinations are that in well-formed text in any language, no non-starters will occur between the possible pairs of starter characters that primary combine. For example, one never sees the sequence:
According to the old D2, the NFC form of this would be U+AC00 (가) HANGUL SYLLABLE GA + U+0300 (◌̀) COMBINING GRAVE ACCENT; that is, the second starter is not blocked from the first, and combines with it. D2' prevents this, and the NFC form stays the same.
Although one would never hit these sequences in real data, formally it is important to correct the definition for two reasons.
toNFD(x) ≠ toNFD(y)
, but toNFC(x) = toNFC(y)
toNFC(toNFC(x)) ≠ toNFC(x)
An example of where D2 causes failure of idempotency is where x is sequence (A) above followed by U+0323 (◌̣) COMBINING DOT BELOW. The first NFC produces
The second NFC reverses the order of the accents, producing
Idempotency is important for consistent lookup. Requesting A and getting B, and then requesting B and getting C is clearly something to be avoided. Idempotency is also required
so that isNFC(toNFC(x))
.
For all implementations that we have been able to review, either the code follows the example of the sample code, and needs no changes, OR the code change is quite small, typically converting a not-equals (!=) to a less-than (<) on one line of code. As discussed above, in practice no data should be affected.
Test Cases
To see whether an implementation has the problem, check the following cases to make sure that they do not change after applying NFC.
Implementations Surveyed
Other Approaches
If a format (such as for identifiers) uses NFC or NFKC and wants to remain stable across the change, one possible approach is to forbid sequences of the form <non-starter, second-starter>, where the second-starters are defined in the second column of the table Double Starters below. The second-starters are also defined in ComposingChars in the W3C Character Model for the World Wide Web 1.0. (The public Working Draft 22 August 2003 version of the table mistakenly includes 0FB5 and 0FB7. Those are to be fixed in the next version.) These sequences should never appear in real data, so the restriction does not cause any lack of expressiveness.
Alternatively, the format could remain conservative, and continue to use the old version of D2 even after updating to Unicode 4.0.1 or later. In that case, when updating, the format will claim conformance to Unicode Normalization as modified by the use of Unicode 4.0.0 UAX #15 D2. (We may provide a better way to phrase this.)
First Starter | Second Starter |
---|---|
09C7 BENGALI VOWEL SIGN E | 09BE BENGALI VOWEL SIGN AA or 09D7 BENGALI AU LENGTH MARK |
0B47 ORIYA VOWEL SIGN E | 0B3E ORIYA VOWEL SIGN AA or 0B56 ORIYA AI LENGTH MARK or 0B57 ORIYA AU LENGTH MARK |
0BC6 TAMIL VOWEL SIGN E | 0BBE TAMIL VOWEL SIGN AA or 0BD7 TAMIL AU LENGTH MARK |
0BC7 TAMIL VOWEL SIGN EE | 0BBE TAMIL VOWEL SIGN AA |
0B92 TAMIL LETTER O | 0BD7 TAMIL AU LENGTH MARK |
0CC6 KANNADA VOWEL SIGN E | 0CC2 KANNADA VOWEL SIGN UU or 0CD5 KANNADA LENGTH MARK or 0CD6 KANNADA AI LENGTH MARK |
0CBF KANNADA VOWEL SIGN I or 0CCA KANNADA VOWEL SIGN O |
0CD5 KANNADA LENGTH MARK |
0D47 MALAYALAM VOWEL SIGN EE | 0D3E MALAYALAM VOWEL SIGN AA |
0D46 MALAYALAM VOWEL SIGN E | 0D3E MALAYALAM VOWEL SIGN AA or 0D57 MALAYALAM AU LENGTH MARK |
1025 MYANMAR LETTER U | 102E MYANMAR VOWEL SIGN II |
0DD9 SINHALA VOWEL SIGN KOMBUVA | 0DCF SINHALA VOWEL SIGN AELA-PILLA or 0DDF SINHALA VOWEL SIGN GAYANUKITTA |
1100..1112 HANGUL CHOSEONG KIYEOK..HIEUH [19 instances] | 1161..1175 HANGUL JUNGSEONG A..I [21 instances] |
[:HangulSyllableType=LV:] | 11A8..11C2 HANGUL JONGSEONG KIYEOK..HIEUH [27 instances] |