L2/04-277
Subject: |
Assessment of Collation Proposals |
Status: |
Personal Contribution for SC2 ad hoc on ISO 14651 |
Author: |
Mark Davis |
Date: |
2004-06-23 |
We've considered at some length a number of issues raised by Kent in recent documents regarding collation. This document contains an assessment of those proposals, and provides
background information.
- Stability of 4.0. There should be no changes should be made in the version of 14651 that is to be in sync with Unicode 4.0. We should preserve that as
is, and any new changes should be aimed at Unicode/UCA Version 4.1 and the following version of ISO 14651. (It is unclear whether Kent is asking for changes in this balloted
version, but we need to be clear.)
- Logical-Order Exceptions. We have looked over the proposal for the logical-order exception characters, changing Thai / Lao to use contractions instead of logical
order, as proposed by Kent. While this does make Thai / Lao data larger, it of acceptable size and is much simpler to implement than what is currently in UCA. A corresponding
change in UCA would be required, retracting [S1.2].
- This would mean adding a set of contractions for Thai and Lao: for each there are 5 logical-order exception characters, which would form with roughly 30 consonants,
adding about 150 contractions to the default table for UCA 4.1 and the corresponding version of ISO 14651.
- The current S1.2 in UCA may seem very simple, but is quite complicated for implementations to get it right, when one considers all three types of normal operations:
sort key generation, strcoll (direct string comparison) and searching (requiring both forwards and backwards collation element access). (Note: while ICU already handles it,
S1.2 does make it more difficult for other implementations, so it would be better to replace it by just the list of contractions.)
- Jamo Decomposition. We do not see the necessity of a radical decomposition of Hangul (decomposing the Jamo). This has never been a customer complaint, and would
produce bigger sort keys and be slower in comparison. We would need to verify any proposed change across a broad spectrum of Korean users — users who are made aware of the
performance implications for sorting speed and data size.
- Trailing Hangul Weights. We agree with using Trailing Weights to fix Hangul cluster collation, but in tailorings — not in the default table. (For a discussion
of Trailing Weights vs. Terminators, see Cluster Collation, below.) This is for a few reasons:
- The vast majority of people will not need to have the extra contractions needed, which will significantly burden the whole default table for non-Hangul users.
- A different type of implementation can solve the Hangul issue more generally, without the extra contractions.
- Any real Hangul tailoring will have the CJK Ideographs interleaved, which cannot be done in the default table in any event. Since a tailoring is required
for Korean anyway, there is no need to add a burden to the default table that does not fully solve the problem.
- Other Languages. For Thai, Lao, and especially Indic, we have not seen sufficient evidence that these languages must be sorted syllabically. And there is a very
substantial cost in terms of performance and data size to using Trailing Weights when they are not needed (see Cluster Collation, below). Thus
this should not be applied unless there is compelling evidence that it is required for the language, and there is substantial opportunity for review by government and
industry.
- One possibility is to request the changes first in CLDR, allow period for review, then fold into the default table.
- Default Table Criteria. It has become clear that the UTC and SC2 need to spend some time outlining the factors that we should use when assessing how to add characters
to the default table (or whether to change them in a future release). Some initial suggestions are in Annex B.
- Small Thai Changes. For Thai, we had received information from our Thai contacts that the following changes should be made to make the UCA in accordance with the Royal
Dictionary, and are sufficient to do so*. That should be reflected in Annex C.2 and in the default table. The other changes suggested by Kent have not been confirmed, and
should not be made unless more evidence is forthcoming.
- After U+0E24 ฤ THAI CHARACTER RU
Insert the sequence: U+0E24 ฤ {THAI CHARACTER RU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
- After U+0E26 ฦ THAI CHARACTER LU
Insert the sequence: U+0E26 ฦ THAI CHARACTER LU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
- After U+0E44 ไ THAI CHARACTER SARA AI MAIMALAI
Insert the character U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
- Before U+0E47 ็ THAI CHARACTER MAITAIKHU
Insert the character U+0E4E ๎ THAI CHARACTER YAMAKKAN
- After U+0E4B ๋ THAI CHARACTER MAI CHATTAWA
Insert the character U+0E4C ์ THAI CHARACTER THANTHAKHAT
Then the character U+0E4D ํ THAI CHARACTER NIKHAHIT
- After the last secondary ignorable
Insert the character U+0E2F ฯ THAI CHARACTER PAIYANNOI
Then the character U+0E46 ๆ THAI CHARACTER MAIYAMOK
Then the character U+0E4F ๏ THAI CHARACTER FONGMAN
Then the character U+0E5A ๚ THAI CHARACTER ANGKHANKHU
Then the character U+0E5B ๛ THAI CHARACTER KHOMUT
- * Actually, full conformance requires changing space and a couple of other characters to be secondary ignorables, but that cannot be done without a tailoring since it
would disturb too many other languages.
With cluster collation, certain sequences (clusters) of characters sort as primary units. A special case of cluster collation is called syllabic collation: that is where
the clusters correspond to syllables in the language. For background information, see [UCA]. What sorting as primary units means is that if a cluster C1
sorts as primary-less than a cluster C2, then that ordering cannot change based on following characters. In other words, we can never have an X and Y such that
both of the following are true:
If all clusters in the language have the same number of characters, collation is not a problem. The normal character-by-character comparison works. If they are not all the same
length, then something has to be done, otherwise we get the following problem. Suppose C2 above is really composed of C1 + C3. Then any X that
sorts after C3 will cause the above conditions to be violated violated:
Where there are a small number of clusters that are of different lengths, the longer clusters can just be treated as contractions. For example, Hungarian has the following
clusters: [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} {ly} {ny} {sz} {ty} {zs} {ccs} {ddz} {ddzs} {ggy} {lly} {nny} {ssz} {tty} {zzs}]. Most are single
letters, but some are multiple letters (those between {...}), and are handled as contractions in collation. See, for example, the Hungarian CLDR data [Hun].
Where cluster collation has a significant impact is where the clusters have different lengths, and are generative (meaning that there are predictable patterns, but generate a
very large number of combinations of clusters). If all possible combinations had to be expressed as contractions, that could lead to tens of thousands of contractions — resulting
in poor performance and excessive data size.
- Speed. Contractions are slow since they require look-ahead. So defining a contraction slows down any use of the first character of that contraction. That is, if
you define a cluster {ch}, then every occurrance of "c" — with or without a following "h" — will cause slower collation. With a small number of short
contractions, the impact on performance is small, but if a large number of contractions are defined, that can seriously impact performance.
- Size. Contractions require special structure for speed, so their storage in collation tables is many times, in practice, that of single characters. This extra storage
has an impact on the use of collation on small memory machines (e.g. PDAs or phones). Even on larger machines, the larger data size means that data keeps getting pushed out of
the cache, reducing performance.
Both of these factors are, of course, perceived as strong negatives by users.
With the UCA / ISO 14651, much of this generative cluster collation can be accomplished by reordering weights (possibly supplemented by forming contractions) so that a
character-by-character comparison produces the right answer, and reducing the need for huge numbers of contractions.
Let's take an example. Suppose that collation clusters in a particular language are of the form of a Base character plus zero to n Trailing characters: symbolically
expressed as Bn Tn* (where x* means zero or more x's). If B1 is primary-less than B1T1, then B1X < B1T1,
for all X that can start a cluster. The correct ordering can be achieved in UCA by ensuring that X < T1 for all X that can start a cluster. That means that each Tn
have to be higher than each Bn, but also higher than each character outside of the alphabet, such as Latin, Greek, CJK-Ideographs, etc. There is a special area in UCA,
the Trailing Weights area, for this purpose, so that the trailing characters can be tailored to that range, and be higher than all single characters. (See [7.1.4].)
This method does have some limitations. Where multiple initial letters need to act internally as a cluster, one has to fall back on contractions. Thus if B1B1
acts as a cluster, then a character-by-character comparison would end up with B1 > T1. The way to do this is to define the sequence <B1,B2>
as a contraction that is primary-greater than B1.
Note that if you have two different languages that both need trailing weights, the above mechanism can still work. As long as the trailing weights cannot start a cluster, there
is no conflict among them for different languages, in well formed text. (As always, we don't much care about the ordering of ill-formed clusters, as long as the fundamental
consistency requirements of collation are still valid, e.g. transitivity).
The other major alternative for getting the right answer with a case-by-case comparison would be to use Terminators. This would involve adding some syntax to UCA / ISO 14651
that would insert a Terminator after every cluster. The Terminator would be a special weight which is less than all other primary weights in the UCA. For example, we could have
syntax like:
@terminate between [B1...Bn T1...Tn] and [^T1...Tn]
This syntax would have the effect of adding a Terminator between any two characters that matched the above, where [^x..y] means all characters that are not in [x...y]. The
following example shows how Terminators (shown here as Ⓣ) would be solve the problem cited above. We would never be
comparing any X against a trailing part of a cluster; we would always compare a Ⓣ first, which is always less than any character. So the
correct order always results.
Here is a table of comparisons of these methods:
|
Trailing Weights |
Terminator |
1 |
+ |
Simple; only takes data changes |
- |
Requires new syntax and code |
2 |
+ |
Uncompressed Sort Keys shorter |
+ |
Compressed Sort Keys shorter |
3 |
- |
Requires use of contractions to work around restrictions |
+ |
More general |
4 |
- |
Contraction list might be quite large |
+ |
Doesn't need contractions |
5 |
+ |
Faster (in general) |
- |
Slower (in general) |
Notes
- Trailing Weights can be implemented now, either in the default table for UCA / ISO 14651, or in tailorings. Terminators would require new syntax, and all implementations
would need to be upgraded to support that syntax.
- Compression of primary weights can produce significantly smaller sort keys, which affects both database size and speed. Compression depends, however, on contiguity; moving
trailing characters to a different range breaks that contiguity. For example, by our figures, for n Hangul syllables, given an average syllable length of 2.5 jamo, we
get approximately the following:
- uncompressed with Terminators: 7n bytes; with Trailing Weights: 5n bytes
- compressed with Terminators: 2 + 3.5n bytes; with Trailing Weights: 6.5n bytes
- Trailing Weights do require the use of contractions to handle initial sequences. And the approach is not as general as Termination. For example, suppose that we had a
language with clusters of the form either (a) or (b) below:
- Bn Tn*
- Tn*
That is, a Trailing character on its own constitutes a cluster. The Trailing Weights solution does not allow a (b) cluster to be sorted before an (a) cluster. However, for all
the languages we know of, this kind of situation does not obtain.
- Because contractions are required for initial sequences, the data tables for Trailing Weights are larger than for Terminators.
- The speed may depend on the circumstances. As we pointed out above, any contraction is slower, because you have to do a look-ahead, so that works against Trailing Weights for
those cases where it is required. On the other hand, almost every character in the alphabet requires a look-ahead with the Terminator solution.
Our overall conclusion is that we should use that Trailing Weights for now, and only consider adding Terminators if we run across a language that really cannot be correctly
tailored without it.
Hangul has multiple clusters, since the syllables are of the form L+ V+ T*. To do Hangul correctly with Trailing Weights, you need primary weights in the following order (where
the left end is zero weight, and the right end is the top weight):
The Xb are all the scripts sorted before Hangul, and the Xa are all those sorted after. This ordering gives the right results among the following:
Chars |
Comments |
L1V1Xa |
|
L1V1L... |
|
L1V1Xb |
|
L1V1T1 |
Works because T1 > all X and L |
L1V1V2 |
Works because V2 > all T |
L1L2V1 |
Works if L1L2 is a contraction. |
This method requires that all the known LL combinations are treated as contractions. It would not handle any that were not. However, there is relatively good information about
when LL combinations actually occurred — in ancient Hangul — and any others could be added over time.
The Terminator method would assign the following weights:
This ordering gives the right results among the following:
Chars |
Term Insert |
Comments |
L1V1Xa |
L1V1ⓉXa |
|
L1V1L... |
L1V1ⓉL... |
|
L1V1Xb |
L1V1ⓉXb |
|
L1V1T1 |
L1V1T1Ⓣ |
Works because T1 > all X and Ⓣ |
L1V1V2 |
L1V1V2Ⓣ |
Works because V2 > all T |
L1L2V1 |
L1L2V1Ⓣ |
Works because L1 > all V |
Hangul is in a rather unique position, because of all the precomposed characters, and because those precomposed characters are the normal (NFC) form of interchanged text.
Because of that, we have concluded that in implementation it is actually better to take an alternative approach, one that still preserves canonical equivalence, has the right
performance/speed characteristics, and handles all characters correctly without requiring contractions.
We mention this only for background information; it is not necessary for the implementation of Hangul sorting. The method is given by the following (it may be easier to
follow with the example given immediately after):
- Generate a modified weight table, in the following way:
- Each precomposed syllable has a 3-byte weight, where:
- the first byte is identical across the entire range. (An identical first byte allows for primary compression.)
- there are 2 gaps between each syllable's weight. That can be done because there are roughly 60K weights in 2 bytes, and 11,170 syllables to distribute across that 60K
- The first gap is used for tailoring other characters (including Han Ideographs).
- The second gap is reserved for additional syllables, as described below.
- Each Jamo has a 1-byte weight, ensuring that all Ts are below all Vs are below all Ls. The lowest weight is a terminator Ⓣ. These
weights are separate from the default weights, and are just used internally. We'll refer to the byte weight of Ln as WLn
- When any string of Jamo and/or Hangul Syllables is encountered, break it into syllables according to the definition in the Unicode Standard, Chapter 3. Process each syllable
separately:
- If a syllable is canonically equivalent to one of the precomposed Hangul Syllables, then just assign the weight as above
- If not, then find the least syllable that it is greater than; call that the base syllable. Find the reserved gap after the base syllable. Generate a weight sequence
corresponding to the weights of the gap, followed by all the Jamo, followed by the terminator.
Example
Suppose we find a syllable of the form L1L2V1T1. We find the precomposed syllable just less than it: L1V1T1.
The weights for that syllable and the following will look something like:
L1V1T1 |
W1, W2, W3 |
tailoring |
W1, W2, W4 |
gap |
W1, W2, W5 |
L1V1T2 |
W1, W2, W6 |
The generated weight will have the form: <W1, W2, W5, WL1, WL2, WV1, WT1, Ⓣ>. This weight will produce the right ordering, because:
- In comparison to any precomposed syllable (or canonical equivalent):
- It is in the right order: greater than L1V1T1, and less than L1V1T2. That is also the case for any
non-Hangul character.
- In comparison to any other generated form:
- If they have different base syllables they will have the correct ordering (because the bases do).
- If they have the same base syllable, then they will share the first 3 bytes; the correct ordering among the syllables will be given by the sequence of weights, using the
termination method.
This mechanism is unique to Hangul because of the algorithmic syllable formation in the Unicode Standard. The advantages over either Trailing Weights or Terminators are that all
precomposed syllables — which are by a huge margin the most common cases — get 3 bytes of primary weight. For a sequence of n bytes, this compresses to 2 + 2n bytes. The other
syllables sort correctly. They are rather long in bytes, but that doesn't matter because of their extreme infrequency.
This method does require that the Jamo and Hangul Syllables themselves not be tailored. Whenever that is done, we would turn off this mechanism, and fall back to Trailing
Weights.
Annex B. Default Table Criteria
We cannot produce a hard and fast set of rule determining how we should place characters in the default table, but the following is a suggested list of factors that should be
taken into account. Warning: these factors may in some cases conflict; it still takes judgment to weigh the relative importance of these factors for the particular case.
- Common Usage.
- If x < y in all languages that use x, that is strong reason for having that ordering in the default table. This is especially clear if there is only a single major
language that uses a given script.
- For example, Polish is the only real 'user' of the ł, so the placement of that character should follow Polish conventions.
- If some languages have x < y and others don't, then the choice should be based on frequency of occurrence in world data
- Thus the ordering of ä (e.g. ä < b versus z < ä) should be based on its frequency of usage in the languages where it occurs (German, Swedish,...) and the
relative amount of data in those languages, and the frequency of occurrence of the character in those languages. That argues for having ä sort as in German.
- As another example, æ sorts as (a) an expansion of 'ae' in a wide number of languages (e.g. cæsium or Cæsar), but in infrequent usage; or (b) above
z very commonly in a few languages (with relatively small populations); or (c) sorts after a in very few languages, with little data in actual use. Thus the best choice
is either (a) or (b).
- Note that behavior in clusters must be taken into account. Thus if "ch" sorts one way in Slovak, one has to look at all other languages where the sequence
"ch" could occur — not just those languages where it is a cluster.
- Default Table Size / Complexity. However, if the amount of data required for a given language would unduly burden all users of the default table, that may be reason to
not include the tailorings.
- For example, the default table does not contain all the contractions that would be needed to get Japanese behavior correct, including length marks, etc. Instead, that is
left to a tailoring.
- Other Factors.
- If x is generally considered an accent variant of another character (e.g. å, ę ħ, ł, ñ, ʑ,
...) then it should be a secondary (level 2) variant of the other in the default table.
- This may be overridden by a common usage factor, as above.
- If x is generally considered a presentation or case variant of another character (e.g. ❶, ➀, ➊, 1, ⒵, ⓩ, z,
...), then it should be a tertiary (level 3) variant of the other in the default table.
- In particular, the only current failure in the default table is that German ß uppercases to SS, so it should be a tertiary difference, not a secondary
difference. This is patterned after DIN 5007; however, we believe it to be a mistake in that standard and in the default table. (Note also that DIN 5007 as it stands
cannot be fully implemented with UCA / ISO 14651, since it requires human judgment: it was written more for librarians than computers.) See [DIN]
- Even if punctuation, digits, and other symbols may sort after some letters in a language, for uniformity they are all grouped together with similar sorts
of characters in the default table.
- Stability.
- Stability of the ordering is important; where we are discussing a change to an existing order, the evidence should be very clear based on common usage.
- That being said, we have told national bodies that whereas we cannot change character positions in UCS, the appropriate action is to change the sort order
by changing the default table; so we need to be responsive where there are clear cases.