Computing default UCA collation tables

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 15 2003 - 05:36:05 EDT

  • Next message: Marco Cimarosti: "RE: how to sort by stroke (not radical/stroke)"

    After studying how the default UCA collation keys are assigned in the existing (3.1.1) allkeys.txt file, I can figure out that this huge table (used as a base for all further tailorings) can really be simplified a lot (note that the 4th additional key contains the code for the base character and can always be dropped and generated automatically):

    Note above, my comments or suggestions for improvements are shown within *** pairs.

    1) First there's a set of fully ignorable characters, with no specific order (the 4th key is always 0): this designates all characters that can be safely deleted from the input string for comparison purpose. They could be represented by just listing the characters or some general character classes like Cc for control characters in ISO636 (0-1F, 7F) or ISO6429 (80-9F), and format characters like ZeroWidthSpace and Bidi Controls like LeftToRightOverride (200B-200F, FEFF), Shaping controls (206A-206F), Interlinear Annotations (FFF9-FFFC), LanguageTags (E0001, E0020-E007F), and very few script-specific formatters in Syriac and Mongolian (70F, 180B-180E). These can be handled at run-time in UCA applications by exceptions to general table lookups for each script (using the small list of Unicode allocation blocks to determine the behavior).
    All of them use [.0.0.0.0], and could be represented simply by a single [.00] byte in collation keys for strings (other bytes implied).

    2) Then there are a set of optionally ignorable characters (which are ignored at level 3 or below), for scripts in this order: Cyrillic, Hebrew, Arabic, Thai, Tibetan, Musical combining symbols. Note that this order corresponds also to the default order, and correspond to extended versions of these scripts, with characters that are not needed (most often) to recognize words (such as Hebrew points).
    All of them use key [.0.0.0.UUUUU] where UUUUU is the original character. Here also the key is implicit, and can be represented by a single [.00] byte in collation keys for strings (other bytes implied).

    Note that on all the other collation keys discussed after, the L4 key is always the codepoint for the original (not canonically decomposed) codepoint, even if it's a compatibility character with a singleton canonical decomposition. So we do not need to store the L4 key which can be computed algorithmically by a simple identity function.

    3) Then there are all combining characters. They are ordered by script (but I think ***they should be first ordered by combining class to also match directly the NFD order or grouped matching when sorting NFC or NFKC strings***). All their L1 key are 0, and these characters arrange in each category mainly on L2 key, where L3 key is almost always 02, and L4 is infered implicitly as indicated above.

    3.1) general combining diacritics for Latin/Cyrillic/Greek in block U+03xx. Here also, note that a few are canonically decomposable or equivalents, and in this case, the NFD equivalent indicates the real characters for which the collation keys are infered. These are marked with a QQC or QQK field in the comment line, and could be removed from the table as they are computable. Matching exactly the existing UCA order requires writing a few tailoring rules for the begining of this set, for example the Combining LowLine, CommaAbove, ReversedCommaAbove are given lowest L2 priority, despite they have a higher combining class than the following diacritics (probably because their language usage carries very little importance, or they are considered pedantic and their use is almost optional in modern written language and do not even change the intonation, or are defined for completeness but actually not used in existing standardized languages). Another inversion is the solidus overlay (I don't know why), the Cedilla
    and Ogonek which come before the Macron, and a mix of diacritics starting at CombiningXAbove

    3.2) Then there are "ligature halves" for diacritics that should be displayed on the grapheme cluster of the last two characters

    3.3) Then there are script specific diacritics, ordered by their leading compatibility decomposition: Cyrillic, Hebrew points, Arabic (many compatibility equivalent deduced from NFKD, are inserted for initial, medial, final, or ligated forms, and when this occurs the L3 is modified from the default 02 value to the compatibility value described in the UCA technical report), Syriac, Brahmic diacritics (Devanagari to Lao: nukta, candrabindu, anusvara, visarga, anusvara), Tibetan, Myanmar (similar to other Indic scripts), Khmer, Ideographic/Hangul/Kana tone marks, The Unicode codepoint of these characters determine their order in the L1 key. No tailoring needed, L3 is always 02,

    3.4) Then there are the combining technical characters (in U+20D0 to U+20E3). The L3 key is implied (02).

    Then comes non ignorale characters. the NFKD decomposition drives the ordering of collation keys, which are assigned with a non zero L1 key starting at [.0201.], and grouped in distinct ranges. If strings are already decomposed to NFKD, then combining diacritics are separated and handled by the previous rules, and there remains only base characters, which are distinct by their L1 key, L2 is always [..20.] and L3 is always [..02.] unless this is a compatibility character which is assigned a L3 key higher than [..02.] according to the UCA technical report for each class (this means that only the L1 key needs to be stored for canonical characters without compatibility decomposition for the UCA table, as other collation values are implied.

    4) The first range covers all standard Unicode 3 scripts with the exception of Han.

    4.1) First the standard spacing controls (e.g. TAB or LF), ordered by codepoint.

    4.2) Then starts the subsection for variable collation keys. This section is quite complex andshould be given by tailoring them in a lookup table:

    4.2.1) First are space characters (and their compatibility equivalent), all collated to the same L1 key, and then non-blank characters that play the role of a spacing mark (Ogham and Arabic).

    4.2.2) Then there are the spacing version of diacritics, simply ordered by codepoint (***I think that it's a shame that their order do not match the order of non-spacing combining diacritics, and this creates confusion***)

    4.2.3) Then comes the soft hyphens or syllable boundary "markers" (***it's a shame that some "soft" hyphens are placed after the real hyphen-minus which always has a visible glyph in any context***), and then other hypens or dashes, and KanaMiddleDots (all these are placed in the codepoint order, with a single tailoring for the generic Soft-Hyphen placed at the beginning of this set)

    4.2.4) Then comes generic in-sentence punctuation (and their compatibility equivalents): comma, semicolon, colon, followed by very script specific equivalent punctuations for groups in the same sentence (ordered by script, using the standard script order, then ordered by codepoint)

    4.2.5) Then are sentence termination punctuation: exclamation, question, full stop, leaders/ellipsis. I think that the small fullstop is not at the right place and should come before leaders just after fullwidth full stop. I also think that leaders should all have the same L3 key in the decomposition (not true for the ellipsis which is three leaders, but is decomposed with a distinct L3 key for the third one). Other script-specific full stop follow, ordered by script.

    4.2.6. Then comes subsets of apostrophe, quotation marks. Some tailoring is present here with some inversions to make similar signs closer (angle quotation marks are placed after strait and comma-shaped quotation marks).

    4.2.7. Then comes parentheses (note that all compatibility decomposable parenthesized characters are ordered at the opening parenthesis, according to their decomposition, so all these could be dropped from the UCA table and computed), and their compatibility equivalent; same thing for other pairs: square bracket, curly bracket, tibetan and ogham similar marks, square bracket with quill...

    4.2.8. Then comes general technical/typographical symbols: section marks, commercial (copyright, registered, at), mathematical (asterisk, solidus, ampersand, number sign, percent, permille, per ten thousand), daggers and bullets, and prime (***I think that prime characters should be placed after apostrophes in 4.2.6***), ditto, insertion marks (caret, reference mark, character tie...).

    4.2.9. Then comes script-specific punctuation, in the common script order then sorted by codepoint: Armenian, Hebrew, Syriac, Mongolian, Brahmic (Devanagari to Thai), Tibetan, Myanmar, Khmer. (*** I think that some of them, such as Armenian apostrophe, should be reordered after some general punctuation groups above in 4.2.4 to 4.2.8 ***)

    4.2.10) Then comes some problematic sections for modifier letters and some signs: this is where prime is defined as a modifier (*** but should'nt it be ordered with the technical primes in 4.2.8, which themseles should sort after apostrophes in 4.2.6?***), other modifier letters (***which are very near from spacing diacritics in 4.2.2 above and should better e placed there for collation purpose***), and the degree sign (***should better be sorted just the spacing ring in 4.2.2***). Tailoring may correct this, but the UCA table should work in this.

    4.2.11) Then comes some script-specific technical signs or marks (not punctuation), simply ordered by codepoints, starting at the cyrillic thousands sign (most of them are Tibetan, but there are one for Arabic, Bengali, Oriya, Thai, and Canadian Syllabics).

    4.2.12) Then are all the arrow symbols (a few are canonically or compatibly decomposable and marked with a QQC and QQK field in the comment, canonical decompositions using a combining overlay solidus)

    4.2.13) Then are all the mathemetical symbols. (***Some of them should be reordered with Greek characters or with characters with non mathematical semantics but with identical script name, to match them with usage notaly for searches where many confusions are possible***). The current ordering is strange: look for example classical 4 mathematical operators (plus sign, division sign,multiplication sign) not grouped with minus sign,division slash, divides. Some of them will definitely match poorly in searches such as tilde, angle, union, and the remaining ones should be grouped by function (identical to, equivalent,...). Some operators also exist with circled variants that should be grouped with their non circled variants, like for circled digits and letters. Specific tailoring is complex, and UCA should evolve to map them more logically. For now, this block just uses the codepoint ordering.

    4.2.14) Then comes technical symbols, ordered by codepoints. ***Some of them may be "unifiable" by tailoring with other generic characters for search purpose (for example the diameter symbol may be decomposable non compatibly as a circle and solidus overlay, and some APL functional symbols could be ordered with Greek letters or mathematical symbols, or decomposed non compatibly such as symbols with dieresis)***.

    4.2.15) Then comes symbols for control characters (there are not ignorable, like their control counterparts, so it's best to keep them ordered as they are by codepoint order.), OCR characters, box drawing characters, block and geometric characters, dingbats and wingdings (***the placement of asterisk-like and bullet-like symbols, or ornamented punctuations and arrows could be discussed***), Braille patterns (assuming that they cannot be confused with punctuation in any context, they howerver represent characters according to locale-specific transliteration rules, and they would never be searched as is, so sorting them does not seem relevant, and it's best to keep them sorted as is by their pattern codepoint), byzantine musical symbols, and venitian/western musical symbols (***some dingbats above could be rearranged to this subrange***).

    4.2.16) Then comes Asian symbols: ideographic description characters couldbe used to create decomposed ideographs by radical, within a Han rendering engine to limit the size of required fonts. It is followed by some CJK symbols (***the ordering of the ideographic variation indicator is questionable***)

    4.2.17) Here are the character and object replacement characters (***I don't think this is the right place to put them, as they probably should be at end of the "variable" collation elements before modifier letters***)

    4.2.18) The last part in variable collation elements is for international non-base10 numbers or digits (***shouldn't they sort after other base10 digits in the non variable section?***)

    This terminates the variable block, which may still need some reworks, to have more "similar" characters tailored to match usage, notably for punctuations. There's no gap here before the non variable section

    4.3) Then comes modifier letters for length or repeat marks (***may be they should sort in the variable block like other spacing diacritics?***)

    4.4) Then comes all currency signs.

    4.5) Then letter-like symbols (***shouldn't they sort with their corresponding letters?***)

    4.6) Then non-base10 Roman numeral (***why don't they sort like characters in section 4.2.18?***)

    4.7) Then music signs (*** why don't they sort with musical characters insection 4.2.15?***)

    4.8) Then base10 digits (composite compatibility characters are sorted with their decomposed digit value, but with a distinct L3 value). They are not ordered by script, as they all use the same L1 key, but they are decomposed as if there was a diacritic after the ASCII decimal digit, in a supplementary NFKD-like decomposition. This pseudo-diacritic is simply acting as a script modifier, one for each script, allocated in the standard script order after those declared in section 3 above with a zero L1 key, and a new L2 key value per script, and a default (implied) L3 key value 02. (This decomposition is quite similar to similar annotations used in some scripts like Greek that combine a letter and a numeric modifier diacritic to create traditional numbers, except that Unicode UCD does not define such pseudo-diacritic for script transliteration in scripts that have distinct characters for digits, and so the UCD cannot use it in NFD or NFKD decompositions). There's such pseudo-diacritic defined in UCA decompositi
    ons for Arabo-Indic, Extended Arabo-Indic, Brahmi scripts (from Devanagari to Thai and Lao), Tibetan, Myanmar, Ethiopic, Khmer, Mongolian, Hangzhou (traditional Chinese written dialect).
    In this section, ideographic symbols for months, days, and hours are decomposed as per their decomposed NFKD value (which includes one or 2 digits and a ideographic character), and sorted as per their first digit, only changing the L3 key of digits to 04, and appending a HAN collator for the ideographic character for month/day/hour (splitted in two parts, one collating in [L1=FB40+high bits, L2=default 20, L3=max 1F], the second one collating in [8000+low bits, L2=0, L3=0]); other digit variants are also added using the L3 key described in the tale of the UCA reference, as well as fractions decomposed the standard way like other NFKD decompositions. So this large set of UCA entries can be reduced to just 10, for the base digits, with help of NFKD and script-specific pseudo-decomposition.

    4.9) Then comes all Latin letters (accented versions are also decomposed with their NFD, then NFKD only when creating UCA collation keys), from A to Z. All font variants (such as mathematical) collate to the same distinct L3 value, but share the same L1/L2 keys than the base letters. Capital versions are just changing the L2 key from the base small letter.
    Some letters are inserted in the list, to collate specially at some place (for example the ligated "ae" letter is placed between "a" and "b", "dz" is placed between "d" and "e", "ij" is placed between i and j, german sharp s is treated like a pair of s, as if all those were ligatures, Eth is somewhere after "dz" but before "e"), but ***there are some places to tailor these extended Latin letters (for example the Latin alpha, or turned A or turned alpha)***.
    Some square symbols used in ideographic scripts to represent international units are decomposed to NFKD before generating UCA collation keys, as well as symbols like "c/o, and roman numerals independantly of their numeric value (but why only roman numerals 1 to 9, and not higher ones?
    ***In my opinion, this order should also interleave transliterated Greek and Cyrillic characters, notably those that are highly similar like A and Alpha, look for example such usage in Serbian that can be written either with Latin or Cyrillic scripts. The UCA would mostly match the Latin sort order, matching the traditional Cyrillic or Greek sort order is a question of language, so of application level tailoring that should be done too in sync for Latin***

    4.10) Then comes Greek letters, in traditional Greek sort order as per the basic Greek codepoints (diacritics decomposed, and case marked in L2 key), plus Coptic letters ***See my comment on this in section 4.9 (for example look how Latin alpha is sorted: why shouldn't it sort like Greek alpha, and not only as a Latin A ?). APL functional symbols should also be sorted there (see my comment in 4.2.14)***. Here also their mathematical font variants are sorted along the corresponding base Greek letter from the NFKD decomposition.

    4.11) Then comes Cyrillic letters, in traditional Cyrillic sort order as per th basic Cyrillic codepoints (diacritics decomposed, and case marked in L2 key). ***See my comment on this in section 4.9 (for example look how Serbian sorts in both Latin or Cyrillic scripts ?). Some extended letters at end of the basic cyrillic alphabet don't have a clear sort order (so tailoring may be needed in applications).

    4.12) Then comes Georgian and then Armenian (but case difference is only a L3 variant, as if it was a diacritic).

    4.13) Then comes Hebrew (no special problem here, as hebrew points are decomposed canonically, ***however some lines are commented out, disabling the canonical decomposition of dagesh diacritics on compatibility final forms***), and then Arabic (and its many compatibility forms or ligatures). Here also, most lines could be dropped from the table using simply the NFD and NFKD decompositions.

    4.14) Then comes Syriac (note that some letters are decomposed with the Garshuni diacritic, which is still not defined in Unicode as a separate diacritic with a codepoint, so the decomposition appears only in UCA), Thaana,

    4.15) Ethiopic syllables (could use a pseudo-decomposition as pseudo-consonnant + pseudo-vowel sign, similar to Brahmic scripts with an implied A)

    4.16) Brahmic scripts are sorted individually, ***despite they have a common structure, and could be combined (using official transliteration rules), at least Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam***

    4.17) Then comes the more complex Sinhala alphabet. ***However some lines in the UCA table are decomposing sequences that represent the same non-decomposed letter with identical collation; some are commented out, some aren't I just wonder if it's an error or temporary edit in the file that was forgotten***

    4.18) Then comes Thai and Lao characters ***(with ambiguities coming from multiple encodings of the same letter, either decomposed or precomposed, such as SARA AM, and this may be an error in the canonical decomposition table, and could cause problems in applications that expect a unique canonical encoding for the same character)***

    4.19) The Tibetan script is more simple (subjoined letters are however sorted after the corresponding normal letter and not in their codepoint value). ***However it also uses multiple source encodings for the same vowel signs, and this may be ommissions in the NFD decomposition tables, notably for long vowels ***

    4.20) The Myanmar script is sorted like Thai and Lao ***(with the same problem on some long vowels)***

    4.21) Khmer comes after using simple codepoint ordering.

    4.22) Mongolian use simple codepoint ordering on L1 ***(except that TODO, SIBE, MANCHU, and ALI GALI versions of some vowels or consonnants are tailored after the standard corresponding letter, using L1 keys instead of being considered as L2 or L3 variants).***

    4.23) Cherokee syllables are then sorted simply by codepoints (that logically use a consistant ordering of their final vowel, and grouping characters by similar soft and hard similar leading consonnants with same final vowel)

    4.24) Canadian Syllabics is quite similar to Cherokee with an already consistant ordering of codepoints ***In my opinion the R-Cree, West-Cree, Y-Cree, N-Cree, Naskapi, Athapascan, Carrier or Sayisi or medial variants that sort just after the corresponding standard syllable should use a L2 distinction, similar to case distinction, and not a L1 distinction. Some other codepoints, separated in the allocation space should also be grouped such as RE or THE syllables variants***.

    4.25) Ogham letters are simply ordered by codepoint.

    4.26) Runic uses precomputed tailoiring and does not follow strict codepoint ordering

    4.27) Hangul is computable algorithmically for syllables by decomposition, and compatibility letters are sorted from their NFKD equivalent using L3 differences.

    4.28) Hiragana and Katakana are mixed in the collation table by L2 equivalence (using L3 differences). Simple derivation from NFKD removes halfwidth or circled variants by creating L3 differences or by creating equivalent sequences for squared words. However small variants sorting before the standard form with L3 difference does not require tailoring, unlike grouping Katakana with Hiragana with only a L3 difference. Apart from these L3 differences, there's no L2 difference. Softened/Voiced consonnants are sorted by appending the collation element for the voice mark diacritic to the "similar" hard consonnant (Hira/Kata differences are more influent)

    4.29) Bopomofo is mostly sorted with L1 differences from codepoints order (only final versions use L3 difference and don't follow the codepoint order) with some tailored exceptions like LETTER NGG.

    4.30) New standardized scripts for Old Italic and Gothic are sorted by codepoints

    4.31) Deseret is just rearranged to sort capital letters with their base small letter with L3 difference, and fixed L2 value.

    This terminates the first contiguous range of non null L1 keys. Then follows:

    5) CJK ideographs, CJK compatibility ideographs and CJK radicals are allocated with implicit weights, using pairs of collation keys, the first one allocated at [L1=FB40+high bits, with implicit L2=20, L3=2], and the second at [L1=8000+low bits, L2=L3=0]. Compatibility characters are simply remapped to their canonical equivalent before collating. All elements in this area can be simply computed though NFD/NFKD decomposition with implicit weights (some of these include codepoints allocated in the extension ideographs, but are decomposable to basic CJK ideographs). ***So all this large table can be removed. So codepoints do not seem to have any tailored sorting and use a default weight from codepoint values.***

    6) A second range for extension CJK ideographs is similar except that it uses L1=FB80+high bits for the first collation key in the generated pair. ***Here also the table could be avoided as only canonically decomposable characters are present.***

    ====

    What I want to demonstrate here is that the large default UCA table contains a lot of dedundance and does not help creating a good implementation. A simplified version should be designed and documented in the UCA algorithm reference, so that the existing table could be automatically derived.

    As it is, there are only three cases:
    a) fully ignorable, and optionally fully ignorale codepoints that could be deduced from character properties L1=L2=L3=0, found in UCD.
    b) combining characters, and pseudo-diacritics for script overrides: L1=0, L2 in [00..1F], L3=02 (other values of L3 can be generated automatically from compatibility equivalences found in UCD)
    c) all other scripts, ordered individually with L1>=0201, L2=20 (other values can be generated from case properties), L3=02 (other L3 values generated automatically from compatibility euivalences found in UCD).

    I do think that a new much shorter table format should be designed which lists only ordering of non decomposable characters per script, and removing all characters which have same case folding as their base character. The interest of this approach would be less works by Unicode to update this table, and easier generation of binary weights in applications (notably if there are user or language preferences when rearranging the relative scripts, instead of using the default UCA ordering of scripts, which roughly imitates the ordering of Unicode blocks).

    The implementation of UCA collation should reuse the work already performed to implement NFD and NFKD decomposition, (possibly by adding a few pseudo-decompositions to the standard NFKD decompositions to handle the case UCA decompositions of digits), and a few character properties such as existing and standardized case folding tables.

    So the standard UCA table should not override the above standardized tables, but should only list some "tailored" L1 weights in each script (needed only because codepoints of base characters do not match the common UCA ordering). I can see that the allkeys.txt was manually edited, and sometimes includes probable errors. The only interesting part of the default UCA collation table (DUCET) is the primary (L1) ordering of base characters within each script.

    The current complication of the existing "allkeys.txt" table may explain why the updated UCA table could not be prepared when Unicode 4 was released, and it probably demonstrates that the UCA collation algorithm is still not complete, and it should not be considered as a full standard as it is.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 06:21:48 EDT