L2/03-058
Re: | Variant Normalization Forms |
From: | Mark Davis |
Date: | 2003-02-19 |
This contains a draft proposal, as per Action 93-A61.
[93-A61] Action Item for Mark Davis, Editorial Committee: Prepare proposed update text for addition to Unicode Standard Annex #15 Unicode Normalization Forms to address the need for tailorings of normalization, for the next UTC meeting.
This turned out to be fairly tricky, when it came to following out all the ramifications in production software. I have to thank Markus Scherer, who did much of the heavy lifting: prototyping and testing both functional consistency and performance, and finding odd edge cases.
I am still uneasy about whether the whole concept of variant normalization forms are a good idea or not, but at least we have a coherent proposal to discuss at the meeting.
Proposal
Variant Normalization Forms
The Unicode Technical Committee recognizes that some implementations may need to use variant normalization forms, ones that do not match the standard forms in some way. However, there is a significant danger that inconsistent normalization forms will lead to processing incompatibilities and security flaws. Thus only a small number of such variant normalization forms are defined, and their definition is carefully constrained. The two defined Variant Normalization Forms (VNF) in this version of the Unicode Standard are:
Name | Description |
---|---|
VNFC-CI | Identical to NFC, except excluding the decomposition of the CJK COMPATIBILITY IDEOGRAPH characters: F900..FA6A, 2F800..2FA1D |
VNFD-CI | Identical to NFD, except excluding the decomposition of the CJK COMPATIBILITY IDEOGRAPH characters: F900..FA6A, 2F800..2FA1D |
While the above are variants of NFC and NFD, there may in the future be variants of NFKC and NFKD.
Note: The range of compatibility characters above includes some characters that do not have decomposition mappings. This is simply to make the ranges more comprehensible; including such characters has no effect since they are already automatically excluded.
Constraints. The constraints on these (and any possible future) VNFs are that they are formed by excluding a standard, consistent, specified set of characters from decomposition and composition steps in Section 5 Specification of UAX #15 and in Chapter 3 of The Unicode Standard. This set is called the exclusion set. The following are the conditions on any Variant Normalization Form:
Exclusion. A Variant Normalization Form V is defined by by the combination of a Normalization Form NF and an exclusion set ES:
For every character c in ES, V(c) = c
For every string s not containing any characters in ES, V(s) = NF(s)
Consistency. Variant Normalization Forms are defined in pairs, one for composition and one for decomposition. Each pair is consistent with the other in the following way. If VC is a composition Variant Normalization Form and VD is the corresponding decomposition Variant Normalization Form, then for all strings x and y:
VC(x) = VC(y) if and only if VD(x) = VD(y)
This implies that if a character A is excluded from being decomposed, then it is also excluded from being the result of any composition (and vice versa). The exclusion set is the same for both composition and decomposition.
Thus if U+00C4 (Ä) LATIN CAPITAL LETTER A WITH DIAERESIS is excluded, then U+01DE (Ǟ) LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON must also be excluded.