Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri May 16 2003 - 18:33:31 EDT

  • Next message: Jungshik Shin: "Re: Decimal separator with more than one character?"

    On Mon, 12 May 2003, Mark Davis wrote:

      Thank you for your detailed reply.

    > > are not listed in 7.1.4). However, it seems to me that some of
    > these
    > > customizations/tailoring in 7.1.4 are not necessary if an
    > additional step
    > > of preprocssing (in which clusters jamos are decomposed into
    > sequences of
    > > basic jamos) is taken as was proposed by Kent in his paper in
    > 2001-2002.
    >
    > It is also a question of cost. Rearranging the weights so that T < V <
    > L doesn't cost anything in implementations of the algorithm.

      Yes, I'm also thinking in terms of cost and flexibility. I have
    no objection to rearranging the weights so that T < V < L and didn't
    express any in my previous message. That's a very good idea.

      What I don't like is the inflexibility of having to collect all the
    known occurrence of cluster Jamos and giving each of them the primary
    weight in such a way (interleaving) that they can get collated the way
    expected by (South) Koreans. When a new cluster jamo is added to the
    repertoire, it's likely that tailoring has to be made again. It wouldn't
    cost anything at the run-time, but it costs something to retailor
    them. Because it's rare that we have to add new clusters, this may not
    be a realistic concern. Still I find it rather inelegant and not in line
    with the basic principles of Korean script that its inventhors had in
    mind.

    > Terminating each of the subclusters would.

      With T < V < L, why would we need to terminate L+, V+ and T+ separately
    instead of just 'L+V+T*' as a whole? Or am I missing something obvious?

    > > As for condition B.2 in 7.1.4, an alternative to that is just adding
    > > a terminator primary weight to only Hangul syllables without optional
    > > T('s). This terminator primary weight should be less than the primary
    > > weight for any Ts (and that of any V's and Ls by condition A.)

      I was wrong. Any syllable, with or without optional T('s), has
    to be terminated.

    > > As for condition B.1.a, I'm wondering why only L's are mentioned.
    > > The same (contraction) should be applied to multiple V's and T's as well.
    > > In addition, in the paragraph that begins with
    >
    > 2. The same goes for:
    >
    > L V T
    > L V V
    >
    > With all V's greater than all T's, then any sequences that are equal
    > up to the T/V comparison will take the right ordering.

      Well, I might not have been very clear that I wasn't so much
    concerned with the handling of inter-syllble (or Hangul syllable followed
    by non-Hangul) issue as with intra-syllable (or more precisely,
    'inter-vowel', 'inter-leading consonants', and 'inter-trailing
    consonants' ) issues because the former is already well taken care of
    by a prescription or the other suggested in the draft.

      What I was questioning was why the *contraction* (that should be
    applied to seuqneces of V's and T's as well as seuqnces of L's) are
    mentioned only about sequences of L's. For instance, suppose that
    we have three sequences LV1, LV2, and LV1V4 where V2 is a cluster of
    V1 and V3 and that the desired collation among them is LV1 < LV2 (=
    LV1V3) < LV1V4. Without contracting V1V4 and giving it an indepenent
    primary weight larger than that of V2, they'd be sorted LV1 < LV1V4 <
    LV2, instead. As a real example, consider the following three sequences.

      S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
          U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
      S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG WA)
          U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
      S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
          U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)

    With the primary weights of each Jamo given as following,

      U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
      U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
      U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
      U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
      U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
      U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101

    their primary weight sequences will be [301,251,101], [301,255,101] and
    [301,251,231,101], respectively and they'll be sorted S1 < S3 < S2 instead
    of the correct S1 < S2 < S3 if there's no contraction applied to 'U+1169
    (ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL JUNGSEONG YA)' sequence.
    By contracting 'U+1169 (ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL
    JUNGSEONG YA)' and giving it an independent primary weight larger than
    that of U+116A (ㅘ:HANGUL JUNGSEONG WA) 255 (say, 257), they will be
    sorted S1 < S2 < S3.

    However, we can avoid this entirely if we just decompose the cluster vowel
    'U+116A (ㅘ:HANGUL JUNGSEONG WA)' to 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
    U+1161 (ㅏ:HANGUL JUNGSEONG A)' and do not give the primary weight to
    it. Then we have <301,251, 101>, <301, 251, 201, 101> and <301, 251,
    231, 101>, which leads them to collate S1 < S2 < S3 as desired.

    < a good explanation abuot *inter-syllable* issues snipped >

    > > For condition B.1.a, this means that if L1 has a primary......
    > >
    > > I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and
    > > LiLk
    > > where w(Li) < w(Lj). With that change, it's clear that B.1.a. can
    > > be applied to cases like the one involving U+1105 (ᄅ : HANGUL
    > > CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG RIEUL)
    > > and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL
    > > CHOSEONG RIEUL-HIEUH).
    >
    > L1, L2 are simply variables standing for particular L's; the only
    > reason for that is to stress where they are equal in two different
    > cases. So it is just a terminology difference from Li, Lj.

       'L1L1' would be interpreted as two identical Ls in a row (doublet of
    L1). My point is that they can be different as well (see the example
    given above). Using 'LiLk'(or L1L3 if you prefer) instead of 'L1L1'
    makes it clear, doesn't it?

    > > Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
    > > HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
    > > CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
    > > should be treated identically, but UTS 10(draft) is rather silent on
    > > that perhaps deferring to tailorings.
    >
    > I agree that longer sequences should expand in weights to be
    > equivalent, and that this should be done in the UCA. As I said, it is
    > just taking a while working with WG20*, and in the meantime people
    > need to tailor it.

       Thanks again for your effort to put things into order in cooperation
    with WG20 and I hope WG20 will be able to work together with the UTC
    about this issue soon.

       IMHO, the most elegant (not necessarily the most efficient and
    sound from the engineering point of view [1]) way to do it is not
    enumerating all equivalent sequences but just giving primary weights to
    only 'basic' Jamos and requiring a preprocessing in which cluster jamos
    are decomposed into sequences of basic Jamos. As mentioned above, in
    addition to this, primary weights are assigned to satifsy the condition
    that L > V > T > [syl_terminator], which is already listed in the draft.

      In a sense, this preprocessing ( which is not a part of any Unicode
    normalization) is similar to Thai/Lao reordering. Anyway, I'm hoping that
    the normalization tailoring currently under review will be approved so
    that we'll be able to represent/deal with Korean script in Unicode in
    a way that is more in line with what inventors of the script envisioned
    in the 15th century than we can now.

      Jungshik

    [1] UTS #10 can mention that if the repertoire of Hangul cluster jamos
    is known a priori, the preprocessing can be avoided by a tailoring in
    which all cluster jamos in the repertoir are contracted and assigned
    independent primary weights that interleaves with basic Jamos. This is
    rather similar to what it mentions about a possible shortcut that can be
    taken about Hangul precomposed syllables when no Hangul Jamo is present
    in the repertoire.



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 19:10:57 EDT