L2/00-141
Re: | Proposed Changes to UTR#10: Unicode Collation Algorithm |
Date: | 2000.04.24 |
By: | M. Davis, K. Whistler |
The following are changes that should be made to the text of UTR#10.
1. Fixes for version numbering for consistency with all other "commands"
Old |
New |
---|---|
<version> := <major>.<minor>.<variant> <eol> |
@<version> := <major>.<minor>.<variant> <eol> |
2. Fixes for contracting characters, add pointer to charts
Old |
New |
---|---|
The Default Unicode Collation Element Table is available in two parts: http://www.unicode.org/unicode/reports/tr10/basekeys.txt http://www.unicode.org/unicode/reports/tr10/compkeys.txt The file basekeys.txt lists all the Unicode characters that correspond to a single collation element. The file compkeys.txt lists all the Unicode characters that correspond to a sequence of collation elements. |
The Default Unicode Collation Element Table is available in three parts: http://www.unicode.org/unicode/reports/tr10/basekeys.txt http://www.unicode.org/unicode/reports/tr10/compkeys.txt http://www.unicode.org/unicode/reports/tr10/ctrckeys.txt
A visual representation of the complete table can be found at http://www.unicode.org/unicode/reports/tr10/charts |
1. There was a missing case in the formulation of Shifted, necessary to process the completely ignorables correctly.
Old |
New |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Shifted: Variable collation elements are set to ignorable at levels one
through three. In addition, a new final-level weight is appended, one that is
equal to the old primary weight. All non-variable collation elements get a
fourth-level weight equal to FFFF.
|
Shifted: Variable collation elements are
set to ignorable at levels one through three. In addition, a new
final-level weight is appended, whose value depends on the type:
|
2. To allow for POSIX-style positions:
- ShiftHigh: The same as ShiftLow, except that all non-variable collation elements get a fourth-level weight equal to 0001.
Blanked |
Non- |
ShiftLow |
ShiftHigh |
---|---|---|---|
death |
de luge |
death |
death |
1. Fix a typo: the text says "three" but there are actually only two bullets after it!
Old |
New |
---|---|
Step 1. Produce a normalized form of each input string, applying the following three steps: | Step 1. Produce a normalized form of each input string, applying the following two steps: |
2. In step 1.1, there is a degenerate case that needs to be caught to make the algorithm more complete. We could have multiple rearranging characters in a row, e.g. suppose that R and are each rearranging.
Old |
New |
---|---|
1.2. If any character is marked as rearranging (see §3.1.3 Rearrangement), swap it and the succeeding character. | 1.2. If any character is marked as rearranging
(see §3.1.3 Rearrangement), swap it and the
succeeding character (if there is one). In practice, rearranging
characters should never appear adjacent to one another. If for some reason
they do, then successive pairs in the sequence will be swapped.
|
1. Section 3.2 describes that weights are derived if they are not in the table, but this point should be added explicitly in the text of the algorithm in Step 2.4 to avoid confusion.
Old |
New |
---|---|
2.4. Fetch the corresponding collation element(s) from the table |
2.4. Fetch the corresponding collation element(s) from the table if there is a match. If there is no match, synthesize a weight as described in §7.1 Derived Collation. |
1. Add out-of-range values, make clear implementations can throw exceptions, and do minor editing on that text for clarity.
Old |
New |
---|---|
Non-characters (values of the form xxFFFE or yyFFFF) and unpaired surrogates are ignored: that is, they are mapped to [.0000.0000.0000.]. |
lllegal codepoints Certain codepoints are illegal in a data stream. These include non-characters (codepoints with hex values ending in FFFF or FFFE), unpaired surrogates (codepoints between D800 and DFFF), and out-of-range values (< 0 or > 10FFFF). Implementations may also choose to treat these as error conditions and respond appropriately, such as by throwing an exception. If they are not treated as an error condition, they must be mapped to [.0000.0000.0000.], and thus ignored. |
2. Fix the formulation of the paired code values. The problem is that you can't let the second primary weight have a primary weight of 0000! So you have to use the bottom 15 bits instead of 16 to make sure you can turn a bit on to avoid 0000. Also added some more explanation, and changed from the minimal values to simply 0001 (also avoids the zero problem).
Old |
New |
---|---|
Unassigned code points (including surrogate pairs) are mapped to a sequence of two collation elements based on their Unicode code point value. (These will include private use characters that are not assigned in the implementation environment.) These code points can range from 0 to 10FFFF16, and are given first collation elements with a range of 17 primary values, and a second collation element with a primary consisting of the remaining bits in the code value: 00XXXX => [.FFE0.0020.0002.][.XXXX.0000.0000.] ... 10XXXX => [.FFF0.0020.0002.][.XXXX.0000.0000.]
|
Legal codepoints A legal codepoint that is not explicitly mentioned in the table is mapped a sequence of two collation elements in the following way. The result of this process are collation elements that are sorted in codepoint order, that do not collide with any explicit values in the table, and that can be placed anywhere (e.g. at UNDEFINED) with respect to the explicit collation element mappings (by default, they go after all explicit collation elements). To derive the collation elements, the codepoint is separated into two parts, chosen for the right numerical properties. First, separate off the top 6 bits of the codepoint. Since codepoints can go from 0 to 10FFFF, this will have values from 0 to 2116 (= 3310). Add this to the UNDEFINED. AAAA = UNDEFINED + (CP >> 15); Now take the bottom 15 bits of the code point. Turn the top bit on, so that the value is non-zero. BBBB = (CP & 0x7FFF) | 0x8000; The mapping given to CP is then given by: CP => [.AAAA.0001.0001.][.BBBB.0000.0000.] If a fourth or higher weights are used, then the same pattern is used: they are set to 0001, etc. in the first collation element and 0000 in the second. (Since all distinct codepoints have different AAAA/BBBB combination, the exact values don't matter as long as they are non-zero in the first collation element.) The value for UNDEFINED for the Default Default Unicode Collation Element Table is FF80. This is larger than any explicit primary weight, and thus will not collide with explicit weights. It is not generally necessary to tailor this value to be within the range of explicit weights. However if this is done, the explicit primary weights must be shifted so that none are between UNDEFINED and UNDEFINED + 34. |
3. Right now there is a mismatch between Hangul Syllables and Jamo. This text fixes that:
Old |
New |
---|---|
CJK Ideographs and Hangul Syllables are not explicitly mentioned in the default table, and are mapped to collation elements that are derived from their Unicode code point value. The primary weight is equal to their Unicode character value. Their secondary and tertiary values are the respective minimal values for those levels: in the Default Unicode Collation Element Table, these are 0020 and 0002 respectively. |
CJK Ideographs and Hangul Syllables are not explicitly mentioned in the default table. CJK ideographs are mapped to collation elements that are derived from their Unicode code point value. The primary weight is equal to their Unicode character value. Their secondary and tertiary values are the respective minimal values for those levels: in the Default Unicode Collation Element Table, these are 0020 and 0002 respectively. All other levels, if any, are 0001. (Since all distinct codepoints have different primary weights, the exact values don't matter as long as they are non-zero.) The collation algorithm requires that Hangul Syllables be decomposed. However, if the table is tailored so that the primary weights for Hangul Jamo (and all related characters) are adjusted, then the Hangul Syllables can be left precomposed and treated in the same way as CJK ideographs. That will provide a collation which is approximately the same as UCA, and may be sufficient in environments where individual jamo are not expected. The adjustment is to move each initial jamo (and related characters) to have a primary weights corresponding to the first syllables starting with that jamo, and make all non-initial jamo (and related characters) be ignorable at a primary level. |
There are a number of issues with the data that need to be fixed. Some of these are listed in tables in http://www.macchiato.com/unicode/CollationMismatch.html. The format of that table will be discussed in the meeting.
We plan to do #1 in the immediate future, but do the others all together later on.
The text may need additional changes for clarity beyond what is noted above. These would be approved by the editorial committee, as usual.