The stability of Unicode normalization has been the subject of a number of misunderstandings. In particular, implementers are often unclear about the meaning of the stability guarantees for normalization and how they impact the handling of normalization of Unicode strings across different versions of the Unicode Standard.
This background document introduces new terms that can be useful tools for writers of other specifications. It is proposed to specify a "Stable Normalization Process". The key concept is that once a Unicode string has been successfully normalized via the Stable Normalization Process, it will never change if subsequently normalized again, in any version of Unicode, past or future. (That guarantee is already provided by the existing normalization stability policies, but with this new definition it can be stated more clearly and succinctly.)
The changes to UAX #15 to specify the Stable Normalization Process could be rather small — just adding new definitions and conformance clauses without materially affecting the definition of any existing normalization forms.
It is anticipated, however, that UAX #15 will also have further explanatory information added, and that a more thorough reorganization of the text of UAX #15 will be undertaken to make the concepts and implications of Unicode normalization more accessible to implementers.
(The links below are to final proposed update versions of UAX #15, since the final approved versions are not yet posted.)
To the section:
http://www.unicode.org/reports/tr15/tr15-26.html#Conformance
add:
UAX15-C5. A process that purports to transform text according to the Stable Normalization Process must do so in accordance with the specifications in this document.
To the section:http://www.unicode.org/reports/tr15/tr15-26.html#Specification
add:
Version | Examples | Required Behavior |
---|---|---|
Unicode 3.2 | U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) | must abort with an error if it encounters any of the characters |
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1) | ||
04CF (ӏ) CYRILLIC SMALL LETTER PALOCHKA (defined in Unicode 5.0) | ||
Unicode 4.0 | U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) | will accept the character |
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1) | must abort with an error if it encounters either of the characters | |
0242 (ɂ) LATIN SMALL LETTER GLOTTAL STOP (defined in Unicode 5.0) | ||
Unicode 4.1 | U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) | will accept the characters |
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1) | ||
0242 (ɂ) LATIN SMALL LETTER GLOTTAL STOP (defined in Unicode 5.0) | must abort with an error if it encounters the character | |
Unicode 5.0 | U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) | will accept the characters |
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1) | ||
0242 (ɂ) LATIN SMALL LETTER GLOTTAL STOP (defined in Unicode 5.0) | ||
All Versions | 09C7 (ে) BENGALI VOWEL SIGN E + 0300 ( ̀) COMBINING GRAVE ACCENT + 09BE (া) BENGALI VOWEL SIGN AA |
must abort with an error if it encounters the sequence (an example from Table 10) |