L2/02-010
To: | UTC |
From: | Mark Davis |
Date: | 2001-01-10 |
Re: | UCA Hangul Problem |
The UCA currently has a flaw regarding Hangul sorting. This paper discusses the flaw, and proposes two alternative approaches that would fix it. As a by-product of either fix, it would make it easier to handle syllabic-based sorting such as Khmer.
The UCA currently sorts Hangul as follows:
1 | 가 | {HANGUL SYLLABLE GA} |
2 | 각 | {HANGUL SYLLABLE GAG} |
Notice that GAG comes after GA in Case 1. But in Case 2, it comes before. That is, the order of these two Hangul syllables is reversed when each is followed by a CJK character.
2 | 각一 | {HANGUL SYLLABLE GAG}{U+4E00} |
1 | 가一 | {HANGUL SYLLABLE GA}{U+4E00} |
This is not acceptable: when two characters A and B have different primary order, appending another independent primary-weighted character C to each should not affect the ordering. (Independent means that AC and BC do not form contractions, interact in normalization, or are subject to Thai rearrangement).
Why does this happen? All characters are decomposed when sorting, to preserve canonical equivalence. (This is the logical procedure -- optimizations can be used as long as they have the same order). This results in the following comparisons being made:
|
|
Look at column 3 in Case 1 and 2.
This problem is not limited to Hangul. It happens with any characters where a grapheme cluster can contain multiple primary weights, and yet is sorted as a unit.
There are two fairly straightforward alternatives solutions that would fix this problem. First the two alternatives are presented, then an assessment of them is made. I'll refer to the document L2/01-446, since that is the latest draft of TR10 presented to the committee.
The strategy here is to terminate all necessary grapheme clusters with a special weight value, here called LOW. This is a non-zero weight that is lower than any other non-zero weight (including variable primary weights). This will solve the problem for Hangul, since the new ordering will be:
Source | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
1 | 가一 | {K} | {a} | {L} | {U} | |
2 | 각一 | {K} | {a} | {k} | {L} | {U} |
Since the {k} is compared against an {L} in column three, the correct ordering results.
The changes required in TR10 to implement this are:
1. In a new section 3.2.3, describe the value LOW, and the fact that certain characters can be marked as requiring termination. By default, all of the Hangul initial characters (Choseung) in the default collation table are so marked.
2. In a new section 3.2.1, add syntax is added so that other characters can require termination -- by analogy with the current syntax, it would look like:
@terminate 1100-11FF,... ;
3. To the main algorithm, add a new step 2.4a
2.4a If the last character of the sequence terminates a grapheme cluster, and the first character of that cluster is marked as requiring termination, then add the extra collation weight <LOW, 20, 2>
The strategy here is, instead of terminating specific grapheme clusters with an extra LOW value, all of the trailing characters in the cluster (those that have primary weights) are assigned primary weights in a high range, higher than any other characters. Thus we end up with
Source | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
1 | 가一 | {K} | {a} | {U} | |
2 | 각一 | {K} | {a} | {k} | {U} |
Because we always assign {a} and {k} (and other trailing Hangul characters) to have high weights, higher than anything that might follow them, we end up with the right order. This is applicable to other languages, wherever the trailing characters in the grapheme cluster are distinct from the leading character.
What happens where this latter condition is not true? The UCA has mechanisms that can be called into play in this case, described in Section 3.1.1 Multiple Mappings. For example, suppose that the Hangul Syllable is of the form LLVT instead of LVT (this happens with archaic Hangul). If the LL is to be sorted as a unit, then it would require the addition of a contraction, so that the LL mapped to a single primary. If the second L is to be sorted as if it were trailing, then this would require a contraction-expansion, as described in 3.1.1.
The changes required to TR10 are:
1. In 7.1.3 Implicit Weights, reserve an area of 1024 high primary weights, by changing the BASE weights from:
FFC0 CJK Ideograph FF80 CJK Ideograph Extension A/B FF40 Any other code point
to
FBC0 CJK Ideograph FB80 CJK Ideograph Extension A/B FB40 Any other code point
* 1024 is a reasonable value, given what we know -- and multiple primaries can always be used if necessary, as in 6.2 Large Weight Values).
2. In the Default Unicode Collation Element Table, change the trailing Hangul characters to have primary weights in the Fxxx range, e.g. FCE0..FD7E. These include:
1161 ; [.16E0.0020.0002.1161] # HANGUL JUNGSEONG A ... 11F9 ; [.1773.0020.0002.11F9] # HANGUL JONGSEONG YEORINHIEUH
3. Since the assignment of CJK Ideographs has changed, modify the dependent characters, such as
U+3280 CIRCLED IDEOGRAPH ONE
Option A is quite general. The downside is that it will produce longer sort keys for Hangul (and any other scripts that require termination), and require some extra processing.
For processing, what you can do is turn on a flag whenever a character requires termination. If the flag is on, then each character will require an extra check to see if it is before a grapheme boundary (normally this just requires looking at the character before and after the prospective boundary). This could also be done with a state machine. While this can be fairly fast, collation is very performance sensitive, so this will probably add significantly to comparison of Hangul, and have a small effect on other characters.
An unoptimized implementation will have longer sort keys for Hangul, by about 40%. An optimized implementation can reduce the length considerably, although it will probably still be somewhat longer. It is not yet clear what the performance cost of such optimization would be.
Option B is slightly more limited. It does place some constraints on the values of trailing primary collation weights, and requires that they be distinct from the leading primary weights for the grapheme clusters in question. The contraction and expansion mechanisms must be used whenever there are characters that are sometimes leading and sometimes trailing. However, there is normally no performance or sort-length impact on the UCA. It also has the advantage of essentially no impact on the standard implementations, since it only changes three constants (the UTC had already approved a change to 7.1.2, so implementations would have to change slightly for the UCA Version 9 anyway.)
My recommendation is to implement option B in the next version of the UCA.