From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Mar 15 2007 - 22:42:29 CST
Note that for NFD (and NFKD) it's not enough to decompose the characters
recursively; you also need to reorder them according to their combining
class (except for those whose combining class is 0 that are limiting the
span of characters that can be reordered). You also have to decompose the
Hangul syllables algorithmically (they are absent from the UCD tables)
For NFC (and NFKC), it's also not enough to only recompose the pairs,
because some decomposable characters followed by a combining characters will
have to be recomposed in another way, by permuting the combining characters
if they have different non-zero combining classes, and then recompose the
first one.
Note also that when composing pairs, the NFC or NFKC forms will sometime
have to compose non-adjacent characters, where there may be an intermediate
combining character with a non-zero combining class between the base
character and the combining character selected for combination: this occurs
when there's no composition possible with the first combining character, but
the composition is possible with the next one, provided that it has a
*different* non-zero combining class.
Suppose that the combining classes of characters are like this in a string
in NFD form:
[0, 220, 230, 0]
If composition is not possible with the first two characters, it may be
possible with the (0, 230) pair and the result will be the string whose
character combining classes will be [0, 220, 0] where the first zero results
of the composition of the first and third character. This still preserves
the canonical equivalence because this composed string will still be
reordered when decomposed...
There are also a few combining characters that are precomposed pairs but
must be decomposed first because one of them will recompose with a prior
base character. This occurs for example with the Greek character block)
When computing NFC and NFKC, you still need to compute the reordering of
combining characters even if there's no way to assemble them in pairs,
because not all of them are composable in pairs (or triples for algorithmic
Hangul syllables but these do not require reordering).
Finally when composing characters, be warned that some precomposed
characters are excluded from recomposition (so they must be decomposed in
NFD or NFKD, but not be recomposed in NFC or NFKC; these characters are then
necessarily absent from strings in all four normalized forms, and they are
present in Unicode only for round-trip compatibility with past encodings, or
sometimes excluded of recompositions for stability of the normalized forms
across Unicode versions). You may eventually compose them without breaking
the canonical equivalence, but the resulting string will not be in the
stable normalized form (this means that a string in NFC form is not
necessarily the shorted one within the set of canonically equivalent
strings: NFC is not a compression algorithm).
> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de Daniel Ehrenberg
> Envoyé : vendredi 16 mars 2007 03:49
> À : unicode@unicode.org
> Objet : Implementing NFC
>
> Hi,
> I'm working on adding Unicode support (possibly eventually conformace)
> to an obscure programming language called Factor, which is sort of a
> cross between Forth and Lisp (see factorcode.org for more
> information). One thing that I'm doing is that all strings will always
> be kept in Normalization Form D (as defined in UAX #15: Normalization
> Forms) for processing. That way all canonically equivalent strings
> return true when tested for equality. It wasn't difficult to implement
> NFD (or NFKD); I just needed to read the transformations from
> UnicodeData.txt and apply them recursively to get a hash table of
> characters to canonical/compatability-decomposed strings. But for most
> I/O purposes, I need to use NFC, re-composing all decomposed
> characters. I have no idea how to do this efficiently. In many cases,
> it's more complicated than just turning two adjacent characters into
> one character.
>
> I looked at both the Glib source (which defines basic unicode
> operations) and the Normalizer demo that UAX 15 links to (which, btw,
> only works properly for the BMP, which is bad). They both appear to
> use generally the same strategy: perform as many pairwise compositions
> on adjacent characters as possible. I wonder if I'm reading it wrong,
> because if that's how it operates, then one of the examples in the UAX
> wouldn't work properly: NFC(U+017F U+0323 U+0307) = U+1E9B U+0323.
> This composes two non-adjacent characters. Is there any efficient way
> to do this composition without messing up canonical ordering while
> making sure to compose non-adjacent characters like this? It's an edge
> case, I know, but I want my implementation to be correct.
>
> In many places, the Unicode standard provides clues for
> implementation, but I see none for NFC (or NFKC) and how to compose
> characters. Can anyone help me?
This archive was generated by hypermail 2.1.5 : Thu Mar 15 2007 - 22:46:00 CST