From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon May 19 2003 - 20:14:15 EDT
This is a very long document; I only have a few brief comments.
1. Much of the UCA is derivable from a simpler set of orderings and
rules. The format of the table, however, is intended to make it usable
without having a complex set of rules for derivation.
2. However, you or anyone else could make a modified version which was
simplified in one way or another. For example, ICU preprocesses the
table to reduce the size of sort keys (see the ICU design docs if you
are curious: oss.software.ibm.com/icu/). There are other ways that
someone could preprocess the table. For example, you could also drop
all those characters whose weights are computable from their NFKD
form, for example, and then compute them at runtime.
3. Scattered in and among your analysis are points where you believe
there is an error. I'd like to emphasis again that the UTC does not
consider arbitrary email on the mailing lists on its agenda. If there
are items that you would like to see considered, you can extract them
(and their justification) from this document, and use the feedback
mechanism on the Unicode site to submit them.
Märk Davis
________
mark.davis@jtcsv.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
----- Original Message -----
From: "Philippe Verdy" <verdy_p@wanadoo.fr>
To: <unicode@unicode.org>
Sent: Thursday, May 15, 2003 02:36
Subject: Computing default UCA collation tables
> After studying how the default UCA collation keys are assigned in
the existing (3.1.1) allkeys.txt file, I can figure out that this huge
table (used as a base for all further tailorings) can really be
simplified a lot (note that the 4th additional key contains the code
for the base character and can always be dropped and generated
automatically):
>
> Note above, my comments or suggestions for improvements are shown
within *** pairs.
>
> 1) First there's a set of fully ignorable characters, with no
specific order (the 4th key is always 0): this designates all
characters that can be safely deleted from the input string for
comparison purpose. They could be represented by just listing the
characters or some general character classes like Cc for control
characters in ISO636 (0-1F, 7F) or ISO6429 (80-9F), and format
characters like ZeroWidthSpace and Bidi Controls like
LeftToRightOverride (200B-200F, FEFF), Shaping controls (206A-206F),
Interlinear Annotations (FFF9-FFFC), LanguageTags (E0001,
E0020-E007F), and very few script-specific formatters in Syriac and
Mongolian (70F, 180B-180E). These can be handled at run-time in UCA
applications by exceptions to general table lookups for each script
(using the small list of Unicode allocation blocks to determine the
behavior).
> All of them use [.0.0.0.0], and could be represented simply by a
single [.00] byte in collation keys for strings (other bytes implied).
>
> 2) Then there are a set of optionally ignorable characters (which
are ignored at level 3 or below), for scripts in this order: Cyrillic,
Hebrew, Arabic, Thai, Tibetan, Musical combining symbols. Note that
this order corresponds also to the default order, and correspond to
extended versions of these scripts, with characters that are not
needed (most often) to recognize words (such as Hebrew points).
> All of them use key [.0.0.0.UUUUU] where UUUUU is the original
character. Here also the key is implicit, and can be represented by a
single [.00] byte in collation keys for strings (other bytes implied).
>
> Note that on all the other collation keys discussed after, the L4
key is always the codepoint for the original (not canonically
decomposed) codepoint, even if it's a compatibility character with a
singleton canonical decomposition. So we do not need to store the L4
key which can be computed algorithmically by a simple identity
function.
>
> 3) Then there are all combining characters. They are ordered by
script (but I think ***they should be first ordered by combining class
to also match directly the NFD order or grouped matching when sorting
NFC or NFKC strings***). All their L1 key are 0, and these characters
arrange in each category mainly on L2 key, where L3 key is almost
always 02, and L4 is infered implicitly as indicated above.
>
> 3.1) general combining diacritics for Latin/Cyrillic/Greek in block
U+03xx. Here also, note that a few are canonically decomposable or
equivalents, and in this case, the NFD equivalent indicates the real
characters for which the collation keys are infered. These are marked
with a QQC or QQK field in the comment line, and could be removed from
the table as they are computable. Matching exactly the existing UCA
order requires writing a few tailoring rules for the begining of this
set, for example the Combining LowLine, CommaAbove, ReversedCommaAbove
are given lowest L2 priority, despite they have a higher combining
class than the following diacritics (probably because their language
usage carries very little importance, or they are considered pedantic
and their use is almost optional in modern written language and do not
even change the intonation, or are defined for completeness but
actually not used in existing standardized languages). Another
inversion is the solidus overla!
> y (I don't know why), the Cedilla and Ogonek which come before the
Macron, and a mix of diacritics starting at CombiningXAbove
>
> 3.2) Then there are "ligature halves" for diacritics that should be
displayed on the grapheme cluster of the last two characters
>
> 3.3) Then there are script specific diacritics, ordered by their
leading compatibility decomposition: Cyrillic, Hebrew points, Arabic
(many compatibility equivalent deduced from NFKD, are inserted for
initial, medial, final, or ligated forms, and when this occurs the L3
is modified from the default 02 value to the compatibility value
described in the UCA technical report), Syriac, Brahmic diacritics
(Devanagari to Lao: nukta, candrabindu, anusvara, visarga, anusvara),
Tibetan, Myanmar (similar to other Indic scripts), Khmer,
Ideographic/Hangul/Kana tone marks, The Unicode codepoint of these
characters determine their order in the L1 key. No tailoring needed,
L3 is always 02,
>
> 3.4) Then there are the combining technical characters (in U+20D0 to
U+20E3). The L3 key is implied (02).
>
> Then comes non ignorale characters. the NFKD decomposition drives
the ordering of collation keys, which are assigned with a non zero L1
key starting at [.0201.], and grouped in distinct ranges. If strings
are already decomposed to NFKD, then combining diacritics are
separated and handled by the previous rules, and there remains only
base characters, which are distinct by their L1 key, L2 is always
[..20.] and L3 is always [..02.] unless this is a compatibility
character which is assigned a L3 key higher than [..02.] according to
the UCA technical report for each class (this means that only the L1
key needs to be stored for canonical characters without compatibility
decomposition for the UCA table, as other collation values are
implied.
>
> 4) The first range covers all standard Unicode 3 scripts with the
exception of Han.
>
> 4.1) First the standard spacing controls (e.g. TAB or LF), ordered
by codepoint.
>
> 4.2) Then starts the subsection for variable collation keys. This
section is quite complex andshould be given by tailoring them in a
lookup table:
>
> 4.2.1) First are space characters (and their compatibility
equivalent), all collated to the same L1 key, and then non-blank
characters that play the role of a spacing mark (Ogham and Arabic).
>
> 4.2.2) Then there are the spacing version of diacritics, simply
ordered by codepoint (***I think that it's a shame that their order do
not match the order of non-spacing combining diacritics, and this
creates confusion***)
>
> 4.2.3) Then comes the soft hyphens or syllable boundary "markers"
(***it's a shame that some "soft" hyphens are placed after the real
hyphen-minus which always has a visible glyph in any context***), and
then other hypens or dashes, and KanaMiddleDots (all these are placed
in the codepoint order, with a single tailoring for the generic
Soft-Hyphen placed at the beginning of this set)
>
> 4.2.4) Then comes generic in-sentence punctuation (and their
compatibility equivalents): comma, semicolon, colon, followed by very
script specific equivalent punctuations for groups in the same
sentence (ordered by script, using the standard script order, then
ordered by codepoint)
>
> 4.2.5) Then are sentence termination punctuation: exclamation,
question, full stop, leaders/ellipsis. I think that the small fullstop
is not at the right place and should come before leaders just after
fullwidth full stop. I also think that leaders should all have the
same L3 key in the decomposition (not true for the ellipsis which is
three leaders, but is decomposed with a distinct L3 key for the third
one). Other script-specific full stop follow, ordered by script.
>
> 4.2.6. Then comes subsets of apostrophe, quotation marks. Some
tailoring is present here with some inversions to make similar signs
closer (angle quotation marks are placed after strait and comma-shaped
quotation marks).
>
> 4.2.7. Then comes parentheses (note that all compatibility
decomposable parenthesized characters are ordered at the opening
parenthesis, according to their decomposition, so all these could be
dropped from the UCA table and computed), and their compatibility
equivalent; same thing for other pairs: square bracket, curly bracket,
tibetan and ogham similar marks, square bracket with quill...
>
> 4.2.8. Then comes general technical/typographical symbols: section
marks, commercial (copyright, registered, at), mathematical (asterisk,
solidus, ampersand, number sign, percent, permille, per ten thousand),
daggers and bullets, and prime (***I think that prime characters
should be placed after apostrophes in 4.2.6***), ditto, insertion
marks (caret, reference mark, character tie...).
>
> 4.2.9. Then comes script-specific punctuation, in the common script
order then sorted by codepoint: Armenian, Hebrew, Syriac, Mongolian,
Brahmic (Devanagari to Thai), Tibetan, Myanmar, Khmer. (*** I think
that some of them, such as Armenian apostrophe, should be reordered
after some general punctuation groups above in 4.2.4 to 4.2.8 ***)
>
> 4.2.10) Then comes some problematic sections for modifier letters
and some signs: this is where prime is defined as a modifier (*** but
should'nt it be ordered with the technical primes in 4.2.8, which
themseles should sort after apostrophes in 4.2.6?***), other modifier
letters (***which are very near from spacing diacritics in 4.2.2 above
and should better e placed there for collation purpose***), and the
degree sign (***should better be sorted just the spacing ring in
4.2.2***). Tailoring may correct this, but the UCA table should work
in this.
>
> 4.2.11) Then comes some script-specific technical signs or marks
(not punctuation), simply ordered by codepoints, starting at the
cyrillic thousands sign (most of them are Tibetan, but there are one
for Arabic, Bengali, Oriya, Thai, and Canadian Syllabics).
>
> 4.2.12) Then are all the arrow symbols (a few are canonically or
compatibly decomposable and marked with a QQC and QQK field in the
comment, canonical decompositions using a combining overlay solidus)
>
> 4.2.13) Then are all the mathemetical symbols. (***Some of them
should be reordered with Greek characters or with characters with non
mathematical semantics but with identical script name, to match them
with usage notaly for searches where many confusions are possible***).
The current ordering is strange: look for example classical 4
mathematical operators (plus sign, division sign,multiplication sign)
not grouped with minus sign,division slash, divides. Some of them will
definitely match poorly in searches such as tilde, angle, union, and
the remaining ones should be grouped by function (identical to,
equivalent,...). Some operators also exist with circled variants that
should be grouped with their non circled variants, like for circled
digits and letters. Specific tailoring is complex, and UCA should
evolve to map them more logically. For now, this block just uses the
codepoint ordering.
>
> 4.2.14) Then comes technical symbols, ordered by codepoints. ***Some
of them may be "unifiable" by tailoring with other generic characters
for search purpose (for example the diameter symbol may be
decomposable non compatibly as a circle and solidus overlay, and some
APL functional symbols could be ordered with Greek letters or
mathematical symbols, or decomposed non compatibly such as symbols
with dieresis)***.
>
> 4.2.15) Then comes symbols for control characters (there are not
ignorable, like their control counterparts, so it's best to keep them
ordered as they are by codepoint order.), OCR characters, box drawing
characters, block and geometric characters, dingbats and wingdings
(***the placement of asterisk-like and bullet-like symbols, or
ornamented punctuations and arrows could be discussed***), Braille
patterns (assuming that they cannot be confused with punctuation in
any context, they howerver represent characters according to
locale-specific transliteration rules, and they would never be
searched as is, so sorting them does not seem relevant, and it's best
to keep them sorted as is by their pattern codepoint), byzantine
musical symbols, and venitian/western musical symbols (***some
dingbats above could be rearranged to this subrange***).
>
> 4.2.16) Then comes Asian symbols: ideographic description characters
couldbe used to create decomposed ideographs by radical, within a Han
rendering engine to limit the size of required fonts. It is followed
by some CJK symbols (***the ordering of the ideographic variation
indicator is questionable***)
>
> 4.2.17) Here are the character and object replacement characters
(***I don't think this is the right place to put them, as they
probably should be at end of the "variable" collation elements before
modifier letters***)
>
> 4.2.18) The last part in variable collation elements is for
international non-base10 numbers or digits (***shouldn't they sort
after other base10 digits in the non variable section?***)
>
> This terminates the variable block, which may still need some
reworks, to have more "similar" characters tailored to match usage,
notably for punctuations. There's no gap here before the non variable
section
>
> 4.3) Then comes modifier letters for length or repeat marks (***may
be they should sort in the variable block like other spacing
diacritics?***)
>
> 4.4) Then comes all currency signs.
>
> 4.5) Then letter-like symbols (***shouldn't they sort with their
corresponding letters?***)
>
> 4.6) Then non-base10 Roman numeral (***why don't they sort like
characters in section 4.2.18?***)
>
> 4.7) Then music signs (*** why don't they sort with musical
characters insection 4.2.15?***)
>
> 4.8) Then base10 digits (composite compatibility characters are
sorted with their decomposed digit value, but with a distinct L3
value). They are not ordered by script, as they all use the same L1
key, but they are decomposed as if there was a diacritic after the
ASCII decimal digit, in a supplementary NFKD-like decomposition. This
pseudo-diacritic is simply acting as a script modifier, one for each
script, allocated in the standard script order after those declared in
section 3 above with a zero L1 key, and a new L2 key value per script,
and a default (implied) L3 key value 02. (This decomposition is quite
similar to similar annotations used in some scripts like Greek that
combine a letter and a numeric modifier diacritic to create
traditional numbers, except that Unicode UCD does not define such
pseudo-diacritic for script transliteration in scripts that have
distinct characters for digits, and so the UCD cannot use it in NFD or
NFKD decompositions). There's such pseudo-di!
> acritic defined in UCA decompositions for Arabo-Indic, Extended
Arabo-Indic, Brahmi scripts (from Devanagari to Thai and Lao),
Tibetan, Myanmar, Ethiopic, Khmer, Mongolian, Hangzhou (traditional
Chinese written dialect).
> In this section, ideographic symbols for months, days, and hours are
decomposed as per their decomposed NFKD value (which includes one or 2
digits and a ideographic character), and sorted as per their first
digit, only changing the L3 key of digits to 04, and appending a HAN
collator for the ideographic character for month/day/hour (splitted in
two parts, one collating in [L1=FB40+high bits, L2=default 20, L3=max
1F], the second one collating in [8000+low bits, L2=0, L3=0]); other
digit variants are also added using the L3 key described in the tale
of the UCA reference, as well as fractions decomposed the standard way
like other NFKD decompositions. So this large set of UCA entries can
be reduced to just 10, for the base digits, with help of NFKD and
script-specific pseudo-decomposition.
>
> 4.9) Then comes all Latin letters (accented versions are also
decomposed with their NFD, then NFKD only when creating UCA collation
keys), from A to Z. All font variants (such as mathematical) collate
to the same distinct L3 value, but share the same L1/L2 keys than the
base letters. Capital versions are just changing the L2 key from the
base small letter.
> Some letters are inserted in the list, to collate specially at some
place (for example the ligated "ae" letter is placed between "a" and
"b", "dz" is placed between "d" and "e", "ij" is placed between i and
j, german sharp s is treated like a pair of s, as if all those were
ligatures, Eth is somewhere after "dz" but before "e"), but ***there
are some places to tailor these extended Latin letters (for example
the Latin alpha, or turned A or turned alpha)***.
> Some square symbols used in ideographic scripts to represent
international units are decomposed to NFKD before generating UCA
collation keys, as well as symbols like "c/o, and roman numerals
independantly of their numeric value (but why only roman numerals 1 to
9, and not higher ones?
> ***In my opinion, this order should also interleave transliterated
Greek and Cyrillic characters, notably those that are highly similar
like A and Alpha, look for example such usage in Serbian that can be
written either with Latin or Cyrillic scripts. The UCA would mostly
match the Latin sort order, matching the traditional Cyrillic or Greek
sort order is a question of language, so of application level
tailoring that should be done too in sync for Latin***
>
> 4.10) Then comes Greek letters, in traditional Greek sort order as
per the basic Greek codepoints (diacritics decomposed, and case marked
in L2 key), plus Coptic letters ***See my comment on this in section
4.9 (for example look how Latin alpha is sorted: why shouldn't it sort
like Greek alpha, and not only as a Latin A ?). APL functional symbols
should also be sorted there (see my comment in 4.2.14)***. Here also
their mathematical font variants are sorted along the corresponding
base Greek letter from the NFKD decomposition.
>
> 4.11) Then comes Cyrillic letters, in traditional Cyrillic sort
order as per th basic Cyrillic codepoints (diacritics decomposed, and
case marked in L2 key). ***See my comment on this in section 4.9 (for
example look how Serbian sorts in both Latin or Cyrillic scripts ?).
Some extended letters at end of the basic cyrillic alphabet don't have
a clear sort order (so tailoring may be needed in applications).
>
> 4.12) Then comes Georgian and then Armenian (but case difference is
only a L3 variant, as if it was a diacritic).
>
> 4.13) Then comes Hebrew (no special problem here, as hebrew points
are decomposed canonically, ***however some lines are commented out,
disabling the canonical decomposition of dagesh diacritics on
compatibility final forms***), and then Arabic (and its many
compatibility forms or ligatures). Here also, most lines could be
dropped from the table using simply the NFD and NFKD decompositions.
>
> 4.14) Then comes Syriac (note that some letters are decomposed with
the Garshuni diacritic, which is still not defined in Unicode as a
separate diacritic with a codepoint, so the decomposition appears only
in UCA), Thaana,
>
> 4.15) Ethiopic syllables (could use a pseudo-decomposition as
pseudo-consonnant + pseudo-vowel sign, similar to Brahmic scripts with
an implied A)
>
> 4.16) Brahmic scripts are sorted individually, ***despite they have
a common structure, and could be combined (using official
transliteration rules), at least Devanagari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam***
>
> 4.17) Then comes the more complex Sinhala alphabet. ***However some
lines in the UCA table are decomposing sequences that represent the
same non-decomposed letter with identical collation; some are
commented out, some aren't I just wonder if it's an error or temporary
edit in the file that was forgotten***
>
> 4.18) Then comes Thai and Lao characters ***(with ambiguities coming
from multiple encodings of the same letter, either decomposed or
precomposed, such as SARA AM, and this may be an error in the
canonical decomposition table, and could cause problems in
applications that expect a unique canonical encoding for the same
character)***
>
> 4.19) The Tibetan script is more simple (subjoined letters are
however sorted after the corresponding normal letter and not in their
codepoint value). ***However it also uses multiple source encodings
for the same vowel signs, and this may be ommissions in the NFD
decomposition tables, notably for long vowels ***
>
> 4.20) The Myanmar script is sorted like Thai and Lao ***(with the
same problem on some long vowels)***
>
> 4.21) Khmer comes after using simple codepoint ordering.
>
> 4.22) Mongolian use simple codepoint ordering on L1 ***(except that
TODO, SIBE, MANCHU, and ALI GALI versions of some vowels or
consonnants are tailored after the standard corresponding letter,
using L1 keys instead of being considered as L2 or L3 variants).***
>
> 4.23) Cherokee syllables are then sorted simply by codepoints (that
logically use a consistant ordering of their final vowel, and grouping
characters by similar soft and hard similar leading consonnants with
same final vowel)
>
> 4.24) Canadian Syllabics is quite similar to Cherokee with an
already consistant ordering of codepoints ***In my opinion the R-Cree,
West-Cree, Y-Cree, N-Cree, Naskapi, Athapascan, Carrier or Sayisi or
medial variants that sort just after the corresponding standard
syllable should use a L2 distinction, similar to case distinction, and
not a L1 distinction. Some other codepoints, separated in the
allocation space should also be grouped such as RE or THE syllables
variants***.
>
> 4.25) Ogham letters are simply ordered by codepoint.
>
> 4.26) Runic uses precomputed tailoiring and does not follow strict
codepoint ordering
>
> 4.27) Hangul is computable algorithmically for syllables by
decomposition, and compatibility letters are sorted from their NFKD
equivalent using L3 differences.
>
> 4.28) Hiragana and Katakana are mixed in the collation table by L2
equivalence (using L3 differences). Simple derivation from NFKD
removes halfwidth or circled variants by creating L3 differences or by
creating equivalent sequences for squared words. However small
variants sorting before the standard form with L3 difference does not
require tailoring, unlike grouping Katakana with Hiragana with only a
L3 difference. Apart from these L3 differences, there's no L2
difference. Softened/Voiced consonnants are sorted by appending the
collation element for the voice mark diacritic to the "similar" hard
consonnant (Hira/Kata differences are more influent)
>
> 4.29) Bopomofo is mostly sorted with L1 differences from codepoints
order (only final versions use L3 difference and don't follow the
codepoint order) with some tailored exceptions like LETTER NGG.
>
> 4.30) New standardized scripts for Old Italic and Gothic are sorted
by codepoints
>
> 4.31) Deseret is just rearranged to sort capital letters with their
base small letter with L3 difference, and fixed L2 value.
>
> This terminates the first contiguous range of non null L1 keys. Then
follows:
>
> 5) CJK ideographs, CJK compatibility ideographs and CJK radicals are
allocated with implicit weights, using pairs of collation keys, the
first one allocated at [L1=FB40+high bits, with implicit L2=20, L3=2],
and the second at [L1=8000+low bits, L2=L3=0]. Compatibility
characters are simply remapped to their canonical equivalent before
collating. All elements in this area can be simply computed though
NFD/NFKD decomposition with implicit weights (some of these include
codepoints allocated in the extension ideographs, but are decomposable
to basic CJK ideographs). ***So all this large table can be removed.
So codepoints do not seem to have any tailored sorting and use a
default weight from codepoint values.***
>
> 6) A second range for extension CJK ideographs is similar except
that it uses L1=FB80+high bits for the first collation key in the
generated pair. ***Here also the table could be avoided as only
canonically decomposable characters are present.***
>
> ====
>
> What I want to demonstrate here is that the large default UCA table
contains a lot of dedundance and does not help creating a good
implementation. A simplified version should be designed and documented
in the UCA algorithm reference, so that the existing table could be
automatically derived.
>
> As it is, there are only three cases:
> a) fully ignorable, and optionally fully ignorale codepoints that
could be deduced from character properties L1=L2=L3=0, found in UCD.
> b) combining characters, and pseudo-diacritics for script overrides:
L1=0, L2 in [00..1F], L3=02 (other values of L3 can be generated
automatically from compatibility equivalences found in UCD)
> c) all other scripts, ordered individually with L1>=0201, L2=20
(other values can be generated from case properties), L3=02 (other L3
values generated automatically from compatibility euivalences found in
UCD).
>
> I do think that a new much shorter table format should be designed
which lists only ordering of non decomposable characters per script,
and removing all characters which have same case folding as their base
character. The interest of this approach would be less works by
Unicode to update this table, and easier generation of binary weights
in applications (notably if there are user or language preferences
when rearranging the relative scripts, instead of using the default
UCA ordering of scripts, which roughly imitates the ordering of
Unicode blocks).
>
> The implementation of UCA collation should reuse the work already
performed to implement NFD and NFKD decomposition, (possibly by adding
a few pseudo-decompositions to the standard NFKD decompositions to
handle the case UCA decompositions of digits), and a few character
properties such as existing and standardized case folding tables.
>
> So the standard UCA table should not override the above standardized
tables, but should only list some "tailored" L1 weights in each script
(needed only because codepoints of base characters do not match the
common UCA ordering). I can see that the allkeys.txt was manually
edited, and sometimes includes probable errors. The only interesting
part of the default UCA collation table (DUCET) is the primary (L1)
ordering of base characters within each script.
>
> The current complication of the existing "allkeys.txt" table may
explain why the updated UCA table could not be prepared when Unicode 4
was released, and it probably demonstrates that the UCA collation
algorithm is still not complete, and it should not be considered as a
full standard as it is.
>
> -- Philippe.
>
>
This archive was generated by hypermail 2.1.5 : Wed May 21 2003 - 21:13:40 EDT