Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon May 12 2003 - 04:08:27 EDT

Next message: Jungshik Shin: "Re: visible glyphs for U+2062 and similar characters"

Previous message: Michael \(michka\) Kaplan: "Re: Simplified Chinese sort sequence in Unicode?"
In reply to: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Reply: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Sun, 11 May 2003, Mark Davis wrote:

> Here is your question, reformatted to always include real characters
> and names.*

Thank you for reformatting. I have no problem adding real characters(
naturally, it's a lot easier for me to type in real characters than code
points), but some people have trouble with real characters in UTF-8 even
on this list so that I just followed the safest way :-) (especially,
I hate to receive their responses mislabelling UTF-8 as ISO-8859-1 and
other MIME charsets.) Well, this cannot be an execuse for not including
the character names. (perhaps, I have to write a simple perl script to
convert any Unicode character in a given range(the default would be any
character above U+007F.) to 'U+xxxx (real character) Unicode Name'.

> > Specifically, U+1102 (ᄂ) HANGUL CHOSEONG NIEUN, U+1103 (ᄃ) HANGUL
> > CHOSEONG TIKEUT and U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK are given
> > the primary weight of 1832, 1833 and 1844, respectively. With these,
> > U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK will be sorted after U+1103
> > (ᄃ) HANGUL CHOSEONG TIKEUT, right? Or am I missing something (I
> > haven't read UTS #10 through, yet)?

> >The order is different from the way (South) Koreans (at least, most
> > Korean dictionary editors) expect them to be sorted. We expect U+1113
> > (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK (and other cluster consonants whose
> > first component is U+1102 (ᄂ) HANGUL CHOSEONG NIEUN. They're U+1114
> > (ᄔ) HANGUL CHOSEONG SSANGNIEUN, U+1115 (ᄕ) HANGUL CHOSEONG
> > NIEUN-TIKEUT, U+1116 (ᄖ) HANGUL CHOSEONG NIEUN-PIEUP) to be put after
> > U+1102 (ᄂ) HANGUL CHOSEONG NIEUN but before U+1103 (ᄃ) HANGUL CHOSEONG
> > TIKEUT. The same is true of any cluster Jamos.

> > Is it UTC's intention to leave the task of making Hangul Jamos
> > collate in accordance with (South) Koreans' expectation to (South)
> > Korean specific tailoring?

> We have been trying to work with the WG20 committee to resolve them,
> due to a desire to maintain synchrony with ISO 14651 in weights.

Thank you for your effort in this regard.

> In the meantime, the work-around is to tailor the Jamo characters to
> interleave the characters properly,

Another way is to decompose all cluster Jamos into a sequence of
basic Jamos and assign weights to _only_ basic Jamos, which you don't
seem to be very fond of apparently because their decomposition is not
included even in the compatibility decomposition in Unicode 3.0 and up
(although it was in Unicode 2.0). The difference between two approach
is :

In the first approach, the treatment of cluster Jamos depends on
whether they're assigned separate code points or not. For instance,
U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) is treated in a different
way from a cluster Jamo (HANGUL CHOSEONG NIEUN-SIOS) of which the only
possible representation is the sequence of U+1102(ᄂ : HANGUL CHOSEONG
NIEUN) and U+1109(ᄉ : HANGUL CHOSEONG SIOS) [1]. Moreover, depending on
implementations, U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) and the
sequence of U+1102(ᄂ : HANGUL CHOSEONG NIEUN) and U+1109 (ᄀ : HANGUL
CHOSEONG KIYEOK) can be treated differently. This is in contrast
to the treatment of Latin/Greek/Cyrillic letters with diacritic marks.
For them, whether precomposed letters (base + diacritic marks) are
separately encoded or not and whether they're represented by precomposed
characters or base + diacritics don't affect their collation.

If we have the full/exhaustive list of all possible combinations
of Jamo sequences (or we deal with the limited repertoire as seems to be
assumed), it's possible to assign weights in such a way that differences
of two kinds mentioned above can be made 'nill'. Even if we don't
(as is allowed in Unicode), you may have a clver method or two (that
are not listed in 7.1.4). However, it seems to me that some of these
customizations/tailoring in 7.1.4 are not necessary if an additional step
of preprocssing (in which clusters jamos are decomposed into sequences of
basic jamos) is taken as was proposed by Kent in his paper in 2001-2002.

> and follow one of the approaches
> in UCA 7.1.4 at
> http://www.unicode.org/reports/tr10/tr10-10.html#Trailing_Weights.

Actually, I read that part before writting my message, but I didn't
mention it (deciding to write about details of that part later) partly
because I don't see how that part _alone_ solves the issue I raised as
you recognized.

As for condition B.2 in 7.1.4, an alternative to that is just adding
a terminator primary weight to only Hangul syllables without optional
T('s). This terminator primary weight should be less than the primary
weight for any Ts (and that of any V's and Ls by condition A.)

As for condition B.1.a, I'm wondering why only L's are mentioned.
The same (contraction) should be applied to multiple V's and T's as well.
In addition, in the paragraph that begins with

For condition B.1.a, this means that if L1 has a primary......

I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and LiLk
where w(Li) < w(Lj). With that change, it's clear that B.1.a. can
be applied to cases like the one involving U+1105 (ᄅ : HANGUL
CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG RIEUL)
and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL
CHOSEONG RIEUL-HIEUH).

Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
should be treated identically, but UTS 10(draft) is rather silent on
that perhaps deferring to tailorings.

> Thanks for bringing this interleaving issue up; we should add a
> description to section 7.1.4.

That will be nice.

[1] I'm not making up these sequences. MS Office XP and Uniscribe support
this sequence (see
http://www.microsoft.com/typography/otfntdev/hangulot/appen.htm).
PARK Won-kyu with my help also has developed a GPL'd opentype font
that supports this sequence along with many others (and will release
a few more). There's a Mozilla patch to support them across platforms
and Pango patch was/is being made.

Next message: Jungshik Shin: "Re: visible glyphs for U+2062 and similar characters"
Previous message: Michael \(michka\) Kaplan: "Re: Simplified Chinese sort sequence in Unicode?"
In reply to: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Reply: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 12 2003 - 05:04:44 EDT