RE: sara am ordering (was RE: Why is U+17C1 of General category Mc while U+0E40 and U+0EC) are of category Lo ?

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Thu Apr 01 2004 - 06:45:38 EST

  • Next message: Chris Jacobs: "Re: New Currency sign in Unicode"

    Peter Constable wrote:

    > Your doc says,
    >
    > <quote, emphasis added>
    > And ำ should be ordered as า followed by ํ (**which is the
    > logical sequence, despite the Unicode compatibility decomposition**).
    > </quote>
    >
    > What do you mean here by "logical sequence"? That that's how
    > it should be interpreted phonologically and for sorting
    > purposes,

    Yes.

    > or that that is the correct encoded sequence for
    > decomposed representations?

    Well, it appears that sara am is rarely decomposed in practice
    (unless one applies NFKD or NFKC, like for IDNs).

    However, the spelling convention in Khmer, where the nikhahit
    looks much like it does for Thai and Lao, appears to be to have
    the nikhahit after the vowel mark (and there are no compatibility
    precomposed forms). Ideally the <C, dep. vowel, nikhahit> sequence
    should be interpreted the same as <C, nikhahit, dep. vowel> for Thai,
    Lao, and Khmer (for their respective nikhahits). But all of the nikhahits
    have combining class 0, so that will not follow from Unicode equivalences.
    For collation, at least, my suggestion (in the referred documents) is
    to treat them as equivalent for the orthographically used combinations
    in Thai, Lao, and Khmer.

    > If the latter, that seems to me to be quite wrong: I would
    > not expect *any* data that includes a decomposed
    > representation of sara am to have the sequence < C, sara aa,
    > nikkahit >: it would always be the other way around: < C,
    > nikkahit, sara aa >.

    Perhaps, for Thai and Lao (just because the Unicode decompositions
    are like that). But the expected sequence for the closely related Khmer
    scripts appears to be to have the nikhahit after the dependent vowel...
    Likewise for other Indic scripts, where the nikhahit-related characters
    are typographically clearly after the dependent vowel. However, the
    CTT/DUCET currently give only level 2 weights to visargas and
    anusvaras, ignoring them at level 1. I don't know if they should be
    given level 1 weights also for the other Indic scripts (like they should
    for Lao/Thai/Khmer). (See http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2716.doc.)

                    /kent k

    PS
    While not related to Indic scripts (but it has similar grouping, with similar
    solution), I also submitted this contribution on Hangul collation to WG2:
     http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2715.doc

    > Of course, if the former, I would agree.
    >
    >
    >
    > Peter
    >
    > Peter Constable
    > Globalization Infrastructure and Font Technologies
    > Microsoft Windows Division
    >
    >
    >





    This archive was generated by hypermail 2.1.5 : Thu Apr 01 2004 - 07:36:37 EST