From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon May 29 2006 - 14:16:55 CDT
Theodore H. Smith wrote on Sunday, May 28, 2006 at 3:18 PM
> I'm wondering, what limitations would it have for being useful for doing
> decomposition? And for doing composition?
The limitation on what you currently have is that it may not work with
letters with two or more combining marks.
> Is it true, that if I perform a proper combining character reordering (As
> described by UTR15) upon some Unicode text, and then did my "parallel
> string replacement based composer" upon the text, that I'd generate
> correct NFC?
No. Consider <U+006D LATIN SMALL LETTER M, U+0325 COMBINING RING BELOW,
U+0301 COMBINING ACUTE ACCENT>, which is in NFD. (It is the last letter in
one spelling of the reconstructed Proto-Indo-European word for 'seven' -
septḿ̥) The canonically equivalent NFC form is <U+1E3F LATIN SMALL LETTER
M WITH ACUTE, U+0325 COMBINING RING BELOW>. This happens because there is
no 'LATIN SMALL LETTER M WITH COMBINING RING BELOW'. On the other hand the
more traditional spelling, septṃ́, ends in what is expressed in NFD as
<U+006D LATIN SMALL LETTER M, U+0323 COMBINING DOT BELOW, U+0301 COMBINING
ACUTE ACCENT>. The canonically equivalent NFC form is <U+1E43 LATIN SMALL
LETTER M WITH DOT BELOW, U+0301 COMBINING ACUTE ACCENT>.
The complication in forming NFC is choosing which following character of
non-zero combining class to combine with. The aim is to consider all of
those following characters which could come next to what has been combined
so far in a canonically equivalent sequence. The one choice for fusion is
in a sense arbitrary, but so as to have a *canonical* form the one that
comes first in NFD order is chosen.
Thus, for the first sequence above, <U+006D, U+0325, U+0301> and <U+006D,
U+0301, U+0325> are equivalent. However, only U+0301 and U+006D combine, so
one combines to yield <U+1E3F, U+0325>. Further combination is not
possible.
For the second sequence, <U+006D, U+0323, U+0301> and <U+006D, U+0301,
U+0323> are equivalent. U+006D could combine with either of the following
combining characters. U+0323 is of combining class 220 and U+0301 is of
combining class 230, so for definiteness U+006D is combined with U+0323,
yielding <U+1E43, U+0301>. Further combination is not possible.
Richard.
This archive was generated by hypermail 2.1.5 : Mon May 29 2006 - 14:26:21 CDT