Assessment of Collation Proposals

L2/04-277

Subject:	Assessment of Collation Proposals
Status:	Personal Contribution for SC2 ad hoc on ISO 14651
Author:	Mark Davis
Date:	2004-06-23

We've considered at some length a number of issues raised by Kent in recent documents regarding collation. This document contains an assessment of those proposals, and provides background information.

Assessment of Proposals

Stability of 4.0. There should be no changes should be made in the version of 14651 that is to be in sync with Unicode 4.0. We should preserve that as is, and any new changes should be aimed at Unicode/UCA Version 4.1 and the following version of ISO 14651. (It is unclear whether Kent is asking for changes in this balloted version, but we need to be clear.)
Logical-Order Exceptions. We have looked over the proposal for the logical-order exception characters, changing Thai / Lao to use contractions instead of logical order, as proposed by Kent. While this does make Thai / Lao data larger, it of acceptable size and is much simpler to implement than what is currently in UCA. A corresponding change in UCA would be required, retracting [S1.2].
- This would mean adding a set of contractions for Thai and Lao: for each there are 5 logical-order exception characters, which would form with roughly 30 consonants, adding about 150 contractions to the default table for UCA 4.1 and the corresponding version of ISO 14651.
- The current S1.2 in UCA may seem very simple, but is quite complicated for implementations to get it right, when one considers all three types of normal operations: sort key generation, strcoll (direct string comparison) and searching (requiring both forwards and backwards collation element access). (Note: while ICU already handles it, S1.2 does make it more difficult for other implementations, so it would be better to replace it by just the list of contractions.)
Jamo Decomposition. We do not see the necessity of a radical decomposition of Hangul (decomposing the Jamo). This has never been a customer complaint, and would produce bigger sort keys and be slower in comparison. We would need to verify any proposed change across a broad spectrum of Korean users — users who are made aware of the performance implications for sorting speed and data size.
Trailing Hangul Weights. We agree with using Trailing Weights to fix Hangul cluster collation, but in tailorings — not in the default table. (For a discussion of Trailing Weights vs. Terminators, see Cluster Collation, below.) This is for a few reasons:
- The vast majority of people will not need to have the extra contractions needed, which will significantly burden the whole default table for non-Hangul users.
- A different type of implementation can solve the Hangul issue more generally, without the extra contractions.
- Any real Hangul tailoring will have the CJK Ideographs interleaved, which cannot be done in the default table in any event. Since a tailoring is required for Korean anyway, there is no need to add a burden to the default table that does not fully solve the problem.
Other Languages. For Thai, Lao, and especially Indic, we have not seen sufficient evidence that these languages must be sorted syllabically. And there is a very substantial cost in terms of performance and data size to using Trailing Weights when they are not needed (see Cluster Collation, below). Thus this should not be applied unless there is compelling evidence that it is required for the language, and there is substantial opportunity for review by government and industry.
- One possibility is to request the changes first in CLDR, allow period for review, then fold into the default table.
Default Table Criteria. It has become clear that the UTC and SC2 need to spend some time outlining the factors that we should use when assessing how to add characters to the default table (or whether to change them in a future release). Some initial suggestions are in Annex B.
Small Thai Changes. For Thai, we had received information from our Thai contacts that the following changes should be made to make the UCA in accordance with the Royal Dictionary, and are sufficient to do so*. That should be reflected in Annex C.2 and in the default table. The other changes suggested by Kent have not been confirmed, and should not be made unless more evidence is forthcoming.
1. After U+0E24 ฤ THAI CHARACTER RU
  Insert the sequence: U+0E24 ฤ {THAI CHARACTER RU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
2. After U+0E26 ฦ THAI CHARACTER LU
  Insert the sequence: U+0E26 ฦ THAI CHARACTER LU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
3. After U+0E44 ไ THAI CHARACTER SARA AI MAIMALAI
  Insert the character U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
4. Before U+0E47 ็ THAI CHARACTER MAITAIKHU
  Insert the character U+0E4E ๎ THAI CHARACTER YAMAKKAN
5. After U+0E4B ๋ THAI CHARACTER MAI CHATTAWA
  Insert the character U+0E4C ์ THAI CHARACTER THANTHAKHAT
  Then the character U+0E4D ํ THAI CHARACTER NIKHAHIT
6. After the last secondary ignorable
  Insert the character U+0E2F ฯ THAI CHARACTER PAIYANNOI
  Then the character U+0E46 ๆ THAI CHARACTER MAIYAMOK
  Then the character U+0E4F ๏ THAI CHARACTER FONGMAN
  Then the character U+0E5A ๚ THAI CHARACTER ANGKHANKHU
  Then the character U+0E5B ๛ THAI CHARACTER KHOMUT
7. * Actually, full conformance requires changing space and a couple of other characters to be secondary ignorables, but that cannot be done without a tailoring since it would disturb too many other languages.

Cluster Collation

With cluster collation, certain sequences (clusters) of characters sort as primary units. A special case of cluster collation is called syllabic collation: that is where the clusters correspond to syllables in the language. For background information, see [UCA]. What sorting as primary units means is that if a cluster C₁ sorts as primary-less than a cluster C₂, then that ordering cannot change based on following characters. In other words, we can never have an X and Y such that both of the following are true:

C₁ < C₂
C₁X > C₂Y.

If all clusters in the language have the same number of characters, collation is not a problem. The normal character-by-character comparison works. If they are not all the same length, then something has to be done, otherwise we get the following problem. Suppose C₂ above is really composed of C₁ + C₃. Then any X that sorts after C₃ will cause the above conditions to be violated violated:

Case 1

Case 2

1	C₁
2	C₁	C₃

2	C₁	C₃	Y
1	C₁	X

Where there are a small number of clusters that are of different lengths, the longer clusters can just be treated as contractions. For example, Hungarian has the following clusters: [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} {ly} {ny} {sz} {ty} {zs} {ccs} {ddz} {ddzs} {ggy} {lly} {nny} {ssz} {tty} {zzs}]. Most are single letters, but some are multiple letters (those between {...}), and are handled as contractions in collation. See, for example, the Hungarian CLDR data [Hun].

Where cluster collation has a significant impact is where the clusters have different lengths, and are generative (meaning that there are predictable patterns, but generate a very large number of combinations of clusters). If all possible combinations had to be expressed as contractions, that could lead to tens of thousands of contractions — resulting in poor performance and excessive data size.

Speed. Contractions are slow since they require look-ahead. So defining a contraction slows down any use of the first character of that contraction. That is, if you define a cluster {ch}, then every occurrance of "c" — with or without a following "h" — will cause slower collation. With a small number of short contractions, the impact on performance is small, but if a large number of contractions are defined, that can seriously impact performance.
Size. Contractions require special structure for speed, so their storage in collation tables is many times, in practice, that of single characters. This extra storage has an impact on the use of collation on small memory machines (e.g. PDAs or phones). Even on larger machines, the larger data size means that data keeps getting pushed out of the cache, reducing performance.

Both of these factors are, of course, perceived as strong negatives by users.

Solutions

With the UCA / ISO 14651, much of this generative cluster collation can be accomplished by reordering weights (possibly supplemented by forming contractions) so that a character-by-character comparison produces the right answer, and reducing the need for huge numbers of contractions.

Let's take an example. Suppose that collation clusters in a particular language are of the form of a Base character plus zero to n Trailing characters: symbolically expressed as B_n T_n* (where x* means zero or more x's). If B₁ is primary-less than B₁T₁, then B₁X < B₁T₁, for all X that can start a cluster. The correct ordering can be achieved in UCA by ensuring that X < T₁ for all X that can start a cluster. That means that each T_n have to be higher than each B_n, but also higher than each character outside of the alphabet, such as Latin, Greek, CJK-Ideographs, etc. There is a special area in UCA, the Trailing Weights area, for this purpose, so that the trailing characters can be tailored to that range, and be higher than all single characters. (See [7.1.4].)

This method does have some limitations. Where multiple initial letters need to act internally as a cluster, one has to fall back on contractions. Thus if B₁B₁ acts as a cluster, then a character-by-character comparison would end up with B₁ > T₁. The way to do this is to define the sequence <B₁,B₂> as a contraction that is primary-greater than B₁.

Note that if you have two different languages that both need trailing weights, the above mechanism can still work. As long as the trailing weights cannot start a cluster, there is no conflict among them for different languages, in well formed text. (As always, we don't much care about the ordering of ill-formed clusters, as long as the fundamental consistency requirements of collation are still valid, e.g. transitivity).

The other major alternative for getting the right answer with a case-by-case comparison would be to use Terminators. This would involve adding some syntax to UCA / ISO 14651 that would insert a Terminator after every cluster. The Terminator would be a special weight which is less than all other primary weights in the UCA. For example, we could have syntax like:

@terminate between [B₁...B_n T₁...T_n] and [^T₁...T_n]

This syntax would have the effect of adding a Terminator between any two characters that matched the above, where [^x..y] means all characters that are not in [x...y]. The following example shows how Terminators (shown here as Ⓣ) would be solve the problem cited above. We would never be comparing any X against a trailing part of a cluster; we would always compare a Ⓣ first, which is always less than any character. So the correct order always results.

Case 1

Case 2

1	C₁	Ⓣ
2	C₁	C₃	Ⓣ

1	C₁	Ⓣ	X
2	C₁	C₃	Ⓣ	Y

Here is a table of comparisons of these methods:

	Trailing Weights		Terminator
1	+	Simple; only takes data changes	-	Requires new syntax and code
2	+	Uncompressed Sort Keys shorter	+	Compressed Sort Keys shorter
3	-	Requires use of contractions to work around restrictions	+	More general
4	-	Contraction list might be quite large	+	Doesn't need contractions
5	+	Faster (in general)	-	Slower (in general)

Notes

Trailing Weights can be implemented now, either in the default table for UCA / ISO 14651, or in tailorings. Terminators would require new syntax, and all implementations would need to be upgraded to support that syntax.
Compression of primary weights can produce significantly smaller sort keys, which affects both database size and speed. Compression depends, however, on contiguity; moving trailing characters to a different range breaks that contiguity. For example, by our figures, for n Hangul syllables, given an average syllable length of 2.5 jamo, we get approximately the following:
- uncompressed with Terminators: 7n bytes; with Trailing Weights: 5n bytes
- compressed with Terminators: 2 + 3.5n bytes; with Trailing Weights: 6.5n bytes
Trailing Weights do require the use of contractions to handle initial sequences. And the approach is not as general as Termination. For example, suppose that we had a language with clusters of the form either (a) or (b) below:
1. B_n T_n*
2. T_n*
That is, a Trailing character on its own constitutes a cluster. The Trailing Weights solution does not allow a (b) cluster to be sorted before an (a) cluster. However, for all the languages we know of, this kind of situation does not obtain.
Because contractions are required for initial sequences, the data tables for Trailing Weights are larger than for Terminators.
The speed may depend on the circumstances. As we pointed out above, any contraction is slower, because you have to do a look-ahead, so that works against Trailing Weights for those cases where it is required. On the other hand, almost every character in the alphabet requires a look-ahead with the Terminator solution.

Our overall conclusion is that we should use that Trailing Weights for now, and only consider adding Terminators if we run across a language that really cannot be correctly tailored without it.

Hangul

Hangul has multiple clusters, since the syllables are of the form L+ V+ T*. To do Hangul correctly with Trailing Weights, you need primary weights in the following order (where the left end is zero weight, and the right end is the top weight):

X_b

X_a

The X_b are all the scripts sorted before Hangul, and the X_a are all those sorted after. This ordering gives the right results among the following:

Chars	Comments
L₁V₁X_a
L₁V₁L...
L₁V₁X_b
L₁V₁T₁	Works because T₁ > all X and L
L₁V₁V₂	Works because V₂ > all T
L₁L₂V₁	Works if L₁L₂ is a contraction.

This method requires that all the known LL combinations are treated as contractions. It would not handle any that were not. However, there is relatively good information about when LL combinations actually occurred — in ancient Hangul — and any others could be added over time.

The Terminator method would assign the following weights:

Ⓣ

X_b

X_a

This ordering gives the right results among the following:

Chars	Term Insert	Comments
L₁V₁X_a	L₁V₁ⓉX_a
L₁V₁L...	L₁V₁ⓉL...
L₁V₁X_b	L₁V₁ⓉX_b
L₁V₁T₁	L₁V₁T₁Ⓣ	Works because T₁ > all X and Ⓣ
L₁V₁V₂	L₁V₁V₂Ⓣ	Works because V₂ > all T
L₁L₂V₁	L₁L₂V₁Ⓣ	Works because L₁ > all V

Hangul is in a rather unique position, because of all the precomposed characters, and because those precomposed characters are the normal (NFC) form of interchanged text. Because of that, we have concluded that in implementation it is actually better to take an alternative approach, one that still preserves canonical equivalence, has the right performance/speed characteristics, and handles all characters correctly without requiring contractions.

We mention this only for background information; it is not necessary for the implementation of Hangul sorting. The method is given by the following (it may be easier to follow with the example given immediately after):

Generate a modified weight table, in the following way:
1. Each precomposed syllable has a 3-byte weight, where:
  1. the first byte is identical across the entire range. (An identical first byte allows for primary compression.)
  2. there are 2 gaps between each syllable's weight. That can be done because there are roughly 60K weights in 2 bytes, and 11,170 syllables to distribute across that 60K
    - The first gap is used for tailoring other characters (including Han Ideographs).
    - The second gap is reserved for additional syllables, as described below.
2. Each Jamo has a 1-byte weight, ensuring that all Ts are below all Vs are below all Ls. The lowest weight is a terminator Ⓣ. These weights are separate from the default weights, and are just used internally. We'll refer to the byte weight of L_n as WL_n
When any string of Jamo and/or Hangul Syllables is encountered, break it into syllables according to the definition in the Unicode Standard, Chapter 3. Process each syllable separately:
1. If a syllable is canonically equivalent to one of the precomposed Hangul Syllables, then just assign the weight as above
2. If not, then find the least syllable that it is greater than; call that the base syllable. Find the reserved gap after the base syllable. Generate a weight sequence corresponding to the weights of the gap, followed by all the Jamo, followed by the terminator.

Example

Suppose we find a syllable of the form L₁L₂V₁T₁. We find the precomposed syllable just less than it: L₁V₁T₁. The weights for that syllable and the following will look something like:

L₁V₁T₁	W1, W2, W3
tailoring	W1, W2, W4
gap	W1, W2, W5
L₁V₁T₂	W1, W2, W6

The generated weight will have the form: <W1, W2, W5, WL₁, WL₂, WV₁, WT₁, Ⓣ>. This weight will produce the right ordering, because:

In comparison to any precomposed syllable (or canonical equivalent):
- It is in the right order: greater than L₁V₁T₁, and less than L₁V₁T₂. That is also the case for any non-Hangul character.
In comparison to any other generated form:
- If they have different base syllables they will have the correct ordering (because the bases do).
- If they have the same base syllable, then they will share the first 3 bytes; the correct ordering among the syllables will be given by the sequence of weights, using the termination method.

This mechanism is unique to Hangul because of the algorithmic syllable formation in the Unicode Standard. The advantages over either Trailing Weights or Terminators are that all precomposed syllables — which are by a huge margin the most common cases — get 3 bytes of primary weight. For a sequence of n bytes, this compresses to 2 + 2n bytes. The other syllables sort correctly. They are rather long in bytes, but that doesn't matter because of their extreme infrequency.

This method does require that the Jamo and Hangul Syllables themselves not be tailored. Whenever that is done, we would turn off this mechanism, and fall back to Trailing Weights.

Annex B. Default Table Criteria

We cannot produce a hard and fast set of rule determining how we should place characters in the default table, but the following is a suggested list of factors that should be taken into account. Warning: these factors may in some cases conflict; it still takes judgment to weigh the relative importance of these factors for the particular case.

Common Usage.
1. If x < y in all languages that use x, that is strong reason for having that ordering in the default table. This is especially clear if there is only a single major language that uses a given script.
  - For example, Polish is the only real 'user' of the ł, so the placement of that character should follow Polish conventions.
2. If some languages have x < y and others don't, then the choice should be based on frequency of occurrence in world data
  - Thus the ordering of ä (e.g. ä < b versus z < ä) should be based on its frequency of usage in the languages where it occurs (German, Swedish,...) and the relative amount of data in those languages, and the frequency of occurrence of the character in those languages. That argues for having ä sort as in German.
  - As another example, æ sorts as (a) an expansion of 'ae' in a wide number of languages (e.g. cæsium or Cæsar), but in infrequent usage; or (b) above z very commonly in a few languages (with relatively small populations); or (c) sorts after a in very few languages, with little data in actual use. Thus the best choice is either (a) or (b).
  - Note that behavior in clusters must be taken into account. Thus if "ch" sorts one way in Slovak, one has to look at all other languages where the sequence "ch" could occur — not just those languages where it is a cluster.
Default Table Size / Complexity. However, if the amount of data required for a given language would unduly burden all users of the default table, that may be reason to not include the tailorings.
- For example, the default table does not contain all the contractions that would be needed to get Japanese behavior correct, including length marks, etc. Instead, that is left to a tailoring.
Other Factors.
1. If x is generally considered an accent variant of another character (e.g. å, ę ħ, ł, ñ, ʑ, ...) then it should be a secondary (level 2) variant of the other in the default table.
  - This may be overridden by a common usage factor, as above.
2. If x is generally considered a presentation or case variant of another character (e.g. ❶, ➀, ➊, １, ⒵, ⓩ, ｚ, ...), then it should be a tertiary (level 3) variant of the other in the default table.
  - In particular, the only current failure in the default table is that German ß uppercases to SS, so it should be a tertiary difference, not a secondary difference. This is patterned after DIN 5007; however, we believe it to be a mistake in that standard and in the default table. (Note also that DIN 5007 as it stands cannot be fully implemented with UCA / ISO 14651, since it requires human judgment: it was written more for librarians than computers.) See [DIN]
3. Even if punctuation, digits, and other symbols may sort after some letters in a language, for uniformity they are all grouped together with similar sorts of characters in the default table.
Stability.
1. Stability of the ordering is important; where we are discussing a change to an existing order, the evidence should be very clear based on common usage.
2. That being said, we have told national bodies that whereas we cannot change character positions in UCS, the appropriate action is to change the sort order by changing the default table; so we need to be responsive where there are clear cases.

References

[UCA]	UTS #10: Unicode Collation Algorithm, http://www.unicode.org/reports/tr10/
[Hun]	http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/main/hu.xml, http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/collation/hu.xml.
[7.1.4]	UTS #10: Unicode Collation Algorithm, Section 7.1.4 "Trailing Weights", http://www.unicode.org/reports/tr10/#Trailing_Weights
[S1.2]	UTS #10: Unicode Collation Algorithm, Section 4 "Main Algorithm", Section 4.1 "Normalize each input string", S1.2 http://www.unicode.org/reports/tr10/#Step_1
[DIN]	http://www.unicode.org/L2/L2004/04140-din-letter.html