Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sun May 11 2003 - 14:04:23 EDT

  • Next message: Jungshik Shin: "visible glyphs for U+2062 and similar characters"

    Here is your question, reformatted to always include real characters
    and names.*

    > Specifically, U+1102 (ᄂ) HANGUL CHOSEONG NIEUN, U+1103 (ᄃ) HANGUL
    CHOSEONG TIKEUT and U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK are given
    the primary weight of 1832, 1833 and 1844, respectively. With these,
    U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK will be sorted after U+1103
    (ᄃ) HANGUL CHOSEONG TIKEUT, right? Or am I missing something (I
    haven't read UTS #10 through, yet)?

    >The order is different from the way (South) Koreans (at least, most
    Korean dictionary editors) expect them to be sorted. We expect U+1113
    (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK (and other cluster consonants whose
    first component is U+1102 (ᄂ) HANGUL CHOSEONG NIEUN. They're U+1114
    (ᄔ) HANGUL CHOSEONG SSANGNIEUN, U+1115 (ᄕ) HANGUL CHOSEONG
    NIEUN-TIKEUT, U+1116 (ᄖ) HANGUL CHOSEONG NIEUN-PIEUP) to be put after
    U+1102 (ᄂ) HANGUL CHOSEONG NIEUN but before U+1103 (ᄃ) HANGUL CHOSEONG
    TIKEUT. The same is true of any cluster Jamos.

    > Is it UTC's intention to leave the task of making Hangul Jamos
    collate in accordance with (South) Koreans' expectation to (South)
    Korean specific tailoring?

    We know that there are problems with Korean collation, particularly
    with non-modern Korean characters, and that the fixes will most likely
    involve a reordering of the Jamo characters as well as other changes.
    We have been trying to work with the WG20 committee to resolve them,
    due to a desire to maintain synchrony with ISO 14651 in weights.
    Progress in that committee, unfortunately, has been exceedingly slow.
    At the last committee meeting early this year, we agreed to work out
    details of a requirements document by email, but there has been as yet
    simply no response to the draft suggested by the UTC. So I am less
    than sanguine about the prospects for any kind of timely resolution.

    In the meantime, the work-around is to tailor the Jamo characters to
    interleave the characters properly, and follow one of the approaches
    in UCA 7.1.4 at
    http://www.unicode.org/reports/tr10/tr10-10.html#Trailing_Weights.

    Thanks for bringing this interleaving issue up; we should add a
    description to section 7.1.4.

    Mark

    * Using http://oss.software.ibm.com/cgi-bin/icu/tr with the following
    transform in "Compound 1" will change all instances of U+XXXX to add
    the real character and the hex name; much easier to see what is being
    described.

                  [:^ASCII:] hexandname

    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Jungshik Shin" <jshin@mailaps.org>
    To: "Mark Davis" <mark.davis@jtcsv.com>
    Cc: <unicode@unicode.org>
    Sent: Saturday, May 10, 2003 19:22
    Subject: Re: Proposed Update of UTS #10: Unicode Collation Algorithm

    >
    >
    >
    > On Fri, 9 May 2003, Mark Davis wrote:
    >
    > > There is a new Proposed Update of UTS #10: Unicode Collation
    > > Algorithm, on:
    > >
    > > http://www.unicode.org/reports/tr10/tr10-10.html
    >
    > Just a quck question before reading it through and comment on it.
    Will
    > allkeys.txt for 4.0 keep weights given to Hangul Jamos? The
    following
    > is written under the assumption that it will.
    >
    > Specifically, U+1102 (Nieun), U+1103 (Tikeut) and
    U+1113(Nieun-Kiyeok) are
    > given the primary weight of 1832, 1833 and 1844, respectively. With
    these,
    > U+1113 will be sorted after U+1103, right? Or am I missing something
    > (I haven't read UTS #10 through, yet)? The order is different from
    the
    > way (South) Koreans (at least, most Korean dictionary editors)
    expect
    > them to be sorted. We expect U+1113 (and other cluster consonants
    whose
    > first component is U+1102. They're U+1114, U+1115, U+1116) to be put
    > after U+1102 but before U+1103. The same is true of any cluster
    Jamos.
    > Is it UTC's intention to leave the task of making Hangul Jamos
    collate in
    > accordance with (South) Koreans' expectation to (South) Korean
    specific
    > tailoring?
    >
    > Thanks,
    >
    > Jungshik
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun May 11 2003 - 14:43:47 EDT