L2/01-130
US Comments on FPDAM 1 to ISO/IEC 14651
- International string ordering, Amendment #1
March 27, 2002
The US votes NO with comments on FPDAM 1 to ISO/IEC 14651 (SC22/WG20 N890; L2/01-470). Its vote will be changed to YES if the following two problems are addressed. (Other than to address these two problems, the US prefers the weights that are present in the table).
Due to production problems in generating the data tables, the following item TC4 from L2/01-330 was not implemented in the current data table, although it was accepted (see the disposition of comments on the PDAM, SC22/WG20 N882.) It needs to be implemented to prevent formal ordering problems and maintain synchronization with UCA.
TC4. Modify handling of secondaries for Numerics. These are to be weighted consistent with the approach used in other constructed secondaries (not involving an accent), such as in:
<U16AA> <S16A8>;"<BASE><VRNT1>";"<COMPAT><MIN>";<U16AA> % RUNIC LETTER AC AThus, the following example for a Mongolian digit
<U1811> <S0031>;<MONGL>;<MIN>;<U1811> % MONGOLIAN DIGIT ONEwill become
<U1811> <S0031>;"<BASE><MONGL>";"<MIN><MIN>";<U1811> % MONGOLIAN DIGIT ONEThe list of numeric script secondary symbols to which this should be applied are the following:
<NEGATIVE> <SANSSERIF> <NEGSANSSERIF> <ARABIC> <EXTARABIC> <ETHPC> <NAGAR> <BENGL> <BENGALINUMERATOR> <GURMU> <GUJAR> <ORIYA> <TAMIL> <TELGU> <KNNDA> <MALAY> <THAII> <LAAOO> <BODKA> <MYANM> <KHMER> <MONGL> <CJKVS>
Background. Look at the following example, with:
<U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A <U00E1> <S0061>;"<BASE><AIGUT>";"<MIN><MIN>";<U00E1> % LATIN SMALL LETTER A WITH ACUTE <U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO <U0968> <S0032>;<NAGAR>;<MIN>;<U0968> % DEVANAGARI DIGIT TWO
The following shows how combinations of the first two and second two sort:
Letters | Sort Key |
---|---|
a2 | <S0061><S0032><BASE><BASE>... |
a२ | <S0061><S0032><BASE><NAGAR>... |
á२ | <S0061><S0032><BASE><NAGAR>... |
á2 | <S0061><S0032><BASE><AIGUT><BASE>... |
Notice that in the first two cases we get 2, then Devanagari 2; while in the second two cases we get the reverse. This is clearly wrong; the wrong secondary weights are being compared to one another. To prevent these cases, UCA is adding the following invariant:
For all collation elements,
In general, all Level N weights in Level N-1 ignorables must be strictly less than those in Level N-2 ignorables.
- All secondaries in non-ignorables must be strictly less than those in primary ignorables.
- All tertiaries in primary ignorables must be strictly less than those in secondary ignorables.
The accent in a-acute is a primary-ignoreable, and must thus have a secondary weight less than the secondary weight in Devanagari digit two. While there are different ways to produce this, the easiest way to do this is to expand the Devanagari weight into:
<U0968> <S0032>;"<BASE><NAGAR>";<MIN>;<U0968> % DEVANAGARI DIGIT TWO
To maintain the synchronization between ISO/IEC 14651 and the Unicode Collation Algorithm, the US requests that the primary values for JUNGSEONG and JONGSEONG characters be made higher than any other weights in the Default table. In no case will this result in worse sorting results, and it does preserve synchronization.
<U20000>..<U2A6D6> <S20000>..<S2A6D6>;<BASE>;<MIN>;<U20000>..<U2A6D6> % Han Extension B <U1160> <S1160>;<BASE>;<MIN>;<U1160> % HANGUL JUNGSEONG FILLER .... <U11F9> <S11F9>;<BASE>;<MIN>;<U11F9> % HANGUL JONGSEONG YEORINHIEUH <PLAIN> % Maximal level 4 weight
This change does not preclude adding descriptions of possible preprocessing steps with similar objectives, as some other national bodies may request.
Background. The UCA currently sorts Hangul as follows. ISO/IEC 14651 does the same, whenever NFD (decomposed) data is used, or when archaic Hangul syllables (requiring the use of Jamo) are used.
1 | 가 | {HANGUL SYLLABLE GA} |
2 | 각 | {HANGUL SYLLABLE GAG} |
Notice that GAG comes after GA in Case 1. But in Case 2, it comes before. That is, the order of these two Hangul syllables is reversed when each is followed by a CJK character.
2 | 각一 | {HANGUL SYLLABLE GAG}{U+4E00} |
1 | 가一 | {HANGUL SYLLABLE GA}{U+4E00} |
This is not acceptable: when two characters A and B have different primary order, appending another independent primary-weighted character C to each should not affect the ordering. (Independent means that AC and BC do not form contractions, interact in normalization, or are subject to Thai rearrangement).
Why does this happen? All characters are decomposed when sorting in UCA, to preserve canonical equivalence. (This is the logical procedure -- optimizations can be used as long as they have the same order). This results in the following comparisons being made:
|
|
Look at column 3 in Case 1 and 2.
The Unicode Technical Committee has considered this issue, and for a number of reasons has approved the following solution. In particular, this solution normally has no performance or sort-length impact on the UCA. Collation implementations are extremely sensitive as to both performance and sort-key length, so this is a very important feature. It also has the advantage of essentially no impact on the standard implementations, since it only changes three constants used in the UCA algorithm. The changes that have been approved for UTR #10: Unicode Collation Algorithm are:
1. In 7..1.3 Implicit Weights, an area of 1024* high primary weights is reserved, by changing the BASE weights from:
FFC0 CJK Ideograph FF80 CJK Ideograph Extension A/B FF40 Any other code point
to
FBC0 CJK Ideograph FB80 CJK Ideograph Extension A/B FB40 Any other code point
* 1024 is sufficient room, given that multiple primaries can always be used if necessary, as in 6.2 Large Weight Values).
2. In the Default Unicode Collation Element Table, the trailing Hangul characters are changed to have primary weights in the Fxxx range, e.g. FCE0..FD7E. These include:
1161 ; [.16E0.0020.0002.1161] # HANGUL JUNGSEONG A .... 11F9 ; [.1773.0020.0002.11F9] # HANGUL JONGSEONG YEORINHIEUH
3. Since the assignment of CJK Ideographs has changed, the dependent characters are modified, such as
U+3280 CIRCLED IDEOGRAPH ONE
Because of these changes, the JUNGSEONG and JONGSEONG characters are assigned primary weights in a high range, higher than any other characters. Thus the above Case 2 changes to:
Source | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
1 | 가一 | {K} | {a} | {U} | |
2 | 각一 | {K} | {a} | {k} | {U} |
Because {a} and {k} now have high weights, higher than anything (e.g., {U}) that might follow them, the right order results. The only further issue is the case of multiple lead characters. The UCA and 14651 have mechanisms that can be called into play in this case, described in Section 3.1.1 Multiple Mappings. For example, suppose that the Hangul Syllable is of the form LLVT instead of LVT (this happens with archaic Hangul). If the LL is to be sorted as a unit, then it would require the addition of a contraction, so that the LL mapped to a single primary. If the second L is to be sorted as if it were trailing, then this would require a contraction-expansion, as described in 3.1.1. There are a small number of LL cases -- these can be easily tailored for environments requiring the sorting of archaic Hangul.
Note: Such a strategy can also be used for other languages. For any case where trailing characters in a sequence (grapheme cluster, conjunct, etc) are given primary weights above any other characters, tailoring to high weights can produce the right results.