From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Tue Dec 25 2007 - 06:45:57 CST
Benjamin M Scarborough wrote:
> [...] I'm unclear as to whether the NFC form would be
> <U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW, U+0328
> COMBINING OGONEK> (which is the shortest form) or <U+0104 LATIN
> CAPITAL LETTER A WITH OGONEK, U+0323 COMBINING DOT BELOW, U+0302
> COMBINING CIRCUMFLEX ACCENT>.
The latter. In the canonical decomposition phase, the nonspacing marks
are reordered to a fixed order according to the ccc = Canonical
Combining Class property. (In this case, this happens to coincide with
their original order in the data.) Then, in to canonical composition
phase, characters are combined starting from a starter character like
"A" and first using the _next_ combining mark. Here the ogonek gets
combined with "A", and after this, no further compositions are possible.
One way to check things quickly is to use the BabelPad editor, which
lets you input character data in different ways and then select a string
and use the Convert command first to convert to NFC and then the
characters to U+nnnnn notation.
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Tue Dec 25 2007 - 06:49:21 CST