Revision | 2 |
Authors | Mark Davis and Ken Whistler |
Date | 1998-12-17 |
This Version | http://www.unicode.org/unicode/reports/tr10/tr10-2.html |
Previous Version | http://www.unicode.org/unicode/reports/tr10/tr10-1.html |
Latest Version | http://www.unicode.org/unicode/reports/tr10/ |
This report presents the specifications of the Unicode Collation Algorithm. It is divided into the following sections:
Status of this document
This document has been considered and approved by the Unicode Technical Committee for publication as a Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative, or as normative specification. Please mail corrigenda and other comments to Unicore@Unicode.org.
The Unicode Collation Algorithm describes how to compare two Unicode strings, using the data in a Collation Element Table, and consistent with the linguistic requirements of The Unicode Standard, Version 2.0, Section 5.15 Sorting and Searching. (Readers should be familiar with that section before proceeding.) It also supplies a Default Unicode Collation Element Table as the basis for collation.
The algorithm is designed to satisify the following goals.
Given the standard ordering and the tailoring for any particular country, any two companies or individuals--with their own proprietary implementations--could take any arbitrary Unicode input and produce exactly the same sorted output. In addition, when given a tailoring specifying French accents this algorithm passes the Canadian (and ISO DIS 14651) benchmark, as mentioned in the section on Processing French Accents.
The Default Unicode Collation Element Table explicitly does not provide for the following features:
Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be compared binary to give the correct comparison between the strings for which they were generated.
The Unicode Collation Algorithm assumes multiple-level key weighting, along the lines widely implemented in IBM technology, and as described in the Canadian sorting standard (CSA Z243.4) and the proposed International String Ordering standard (ISO/IEC 14651).
By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:
A fourth level is used for tie-breaking between strings not distinguished at the first three levels; by default this fourth level uses weights defined to be the normalized Unicode code value of a character.
This design allows implementations to produce culturally acceptable collation, while putting the least burden on implementations in terms of memory requirements and performance. In particular, Collation Element Tables only require storage of 32 bits per significant character.
However, implementations of the Unicode Collation Algorithm are not limited to supporting only 3 levels. They are free to support a fully customizable 4th level (or more levels), as long as they can produce the same results as the basic algorithm, given the right Collation Element Tables. For example, an application which uses the algorithm, but which must treat some collection of special characters as ignorable at the first 3 levels and must have those specials collate in non-Unicode order (as, for example to emulate an existing EBCDIC-based collation), may choose to have a fully customizable 4th level. The downside of this choice is that such an application will require more storage, both for the Collation Element Table and in constructed sort keys.
A tailoring mechanism is described for adjusting the Collation Element Table to produce particular culturally required orderings for different languages or locales. This tailoring mechanism allows for full tailoring of the first three levels. As in the algorithm itself, the tailoring mechanism can be extended to provide full customization for four (or more) levels.
There are many different ways to compare strings, and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement the Unicode Collation Algorithm must do so as described in this document.
The algorithm is a logical specification, designed to be straightforward to describe. Actual implementations of the algorithm are free to change any part of the algorithm so long as any two strings compared by the implementation are ordered the same as they would be by the algorithm. See Implementation Notes. The sort key is also a logical intermediate object: so long as an implementation produces the same results in comparison of strings, the sort keys can differ in format from what is specified here.
The requirements for conformance on implementations of the Unicode Collation Algorithm are as follows:
C1 | A conformant implementation shall be able to replicate the same comparisons of Unicode strings as those implied by the algorithm as described in this document, using the Default Unicode Collation Element Table. In particular, a conformant implementation must be able to compare any two canonical equivalent strings as being equal. |
C2 | A conformant implementation shall support at least three levels of collation. A conformant implementation is only required to implement three levels. However, it may implement four (or more) levels if desired. |
C3 | A conformant implementation shall support backwards secondaries. Supporting backwards weights at other levels is optional. |
C4 | When supplied with a tailoring specification as described in this document (or its functional equivalent in another format), a conformant implementation shall be able to replicate the same comparisons of Unicode strings as those implied by the algorithm as described in this document, using the Default Unicode Collation Element Table and the specified tailoring. |
C5 | A conformant implementation shall be able to support the main two tailoring features: positioning entries and turning backward secondaries on or off. Other features of tailoring are optional, including:
A conformant implementation is encouraged to implement these optional features, but is not required to for conformance. |
A Collation Element Table contains a mapping from one (or more) characters to one (or more) collation elements. A collation element is an ordered list of three 16-bit weights. (Implementations can produce the same result without using 16-bit weights--see Implementation Notes.)
The first weight is called the Level 1 weight (or primary weight), the second is called the Level 2 weight (secondary weight), and the third is called the Level 3 weight (tertiary weight). For a collation element X, these can be abbreviated as X1, X2, and X3.Given two collation elements X and Y, we will use the following notation:
Notation |
Reading |
Meaning |
---|---|---|
X <1 Y | X is primary less than Y | X1 < Y1 |
X =1 Y | X is primary equal to Y | X1 = Y1 |
X <2 Y | X is secondary less than Y | X =1 Y and X2 < Y2 |
X =2 Y | X is secondary equal to Y | X =1 Y and X2 = Y2 |
X <3 Y | X is tertiary less than Y | X =2 Y and X3 < Y3 |
X =3 Y | X is tertiary equal to Y | X =2 Y and X3 = Y3 |
X <4 Y | X is quarternary less than Y | X =3 Y and X4 < Y4 |
X =4 Y | X is quarternary equal to Y | X =3 Y and X4 = Y4 |
The collation algorithm results in a similar ordering among characters and strings, so that for two strings A and B we can write A <2 B, meaning that A is less than B and there is only a secondary difference between them. If two strings have no primary, secondary or tertiary difference according to a given Collation Table, then we write A =3 B. If two strings are equivalent (equal at all levels) according to a given Collation Table, we write A = B. If they are bit-for-bit identical, we write A == B.
Where subscripts are not available, forms such as "X <1 Y" or "X1" can be written as "X <[1] Y" and "X[1]" respectively. |
If a weight is 0000, than that collation element is ignorable at that level: the weight at that level is not taken into account in sorting. A collation element that is ignorable at Level 1, but not at Level 2 is called a Level 1 ignorable; similarly, a Level 2 ignorable is ignorable at Levels 1 and 2, but not Level 3, and a Level 3 ignorable is ignorable at Levels 1, 2, and 3 but not Level 4. A collation element that is not ignorable at Level 1 is called a non-ignorable.
No collation element can have a zero weight at Level N and a non-zero weight at Level N-1. Any collation elements that violate this rule are called ill-formed. The reason for this will be explained below, under Step 4 of the main algorithm.
For a given Collation Element Table, MINn is the least weight in any collation element at level n, and MAXn is the maximum weight in any collation element at level n.
The following are sample collation elements that are used in the examples illustrating the algorithm. Unless otherwise noted, all weights are in hexadecimal format.
Character |
Collation Element |
Name |
---|---|---|
0300 "`" | [0000.0021.0002] | COMBINING GRAVE ACCENT |
0061 "a" | [06D9.0020.0002] | LATIN SMALL LETTER A |
0062 "b" | [06EE.0020.0002] | LATIN SMALL LETTER B |
0063 "c" | [0706.0020.0002] | LATIN SMALL LETTER C |
0043 "C" | [0706.0020.0008] | LATIN CAPITAL LETTER C |
0064 "d" | [0712.0020.0002] | LATIN SMALL LETTER D |
Linguistic requirements collation are covered in more detail in The Unicode Standard, Version 2.0, Section 5.15 Sorting and Searching.
The mapping from characters to collation elements may not be a simple mapping from one character to one collation element: in general, it may map from one to many, from many to one, or from many to many. For example:
The Latin letter æ is treated as an independent letter by default. Collations such as English, which may require treating it as equivalent to an <a e> sequence, can tailor the letter to map to the following collation elements, such as in the following example:
Character |
Collation Element |
Name |
---|---|---|
00E6 "æ" | [06D9.0020.0002], [073A.0020.0002] | LATIN SMALL LETTER AE |
In this example, the collation element [06D9.0020.0002] gives the weight values for a, and the collation element [073A.0020.0002] gives the weight values for e.
Similarly, where ch is treated as a single letter as in traditional Spanish, it is represented as a mapping from two characters to a single Collation Element, such as in the following example
Character |
Collation Element |
Name |
---|---|---|
0063 0068 "ch" |
[0707.0020.0002] | LATIN SMALL LETTER C, LATIN SMALL LETTER H |
In this example, the collation element [0707.0020.0002] has a primary value one greater than the primary value for the letter c by itself, so that the sequence ch will collate after c and before d. The above example shows the result of a tailoring of collation elements to weight sequences of letters as a single unit. The Default Unicode Collation Element Table does not contain any contractions.
A soft hyphen (SHY) between two characters of a contraction will cause them to sort as separate characters.
The Unicode Collation Algorithm handles surrogate pairs as contractions. It thus provides the same level of capability for those characters as for non-surrogates. |
Certain characters may both expand and contract: see page 5-31 of The Unicode Standard, Version 2.0.
In some languages (notably French), accents are sorted from the back of the string to the front of the string. This behavior is not marked in the Default Unicode Collation Element Table, but may occur in tailored tables. In such a case, the collation elements for the accents and their base characters are marked as being backwards at Level 2.
There are two other directions: forwards and neutral. Both backwards and forwards are known as strong directions, while neutral is a weak direction.
Certain characters are not coded in logical order. In Unicode 2.1, these are the Thai vowels 0E40 through 0E44 and the Lao vowels 0EC0 through 0EC4 (this list may be extended in the future, and so is included in the Default Unicode Collation Element Table). For collation, they are rearranged by swapped with the following character before further processing, since logically they belong afterwards. For example, here is a string processed by rearrangement:
input string: | 0E01 0E40 0E02 0E03 |
normalized string: | 0E01 0E02 0E40 0E03 |
Both in the Default Unicode Collation Element Table and in typical tailorings, most unaccented letters differ in the primary weights, but have secondary weights (such as a1) equal to MIN2. The primary ignorables will have secondary weights greater than MIN2. Characters that are compatibility or case variants will have equal primary and secondary weights (e.g. a2 = A2), but have different tertiary weights (e.g. a3 < A2). The unmarked characters will have tertiary weights (such as a3) equal to MIN3.
However, it is very important to note that there is no guarantee that weights are uniform across types of variation. It is perfectly possible after tailoring for a2 to be not equal to b2, and A3 to be not equal to B3.
The Default Unicode Collation Element Table is available in two parts:
http://www.unicode.org/unicode/reports/tr10/basekeys.txt
http://www.unicode.org/unicode/reports/tr10/compkeys.txt
The file basekeys.txt lists all the Unicode characters that correspond to a single collation element. The file compkeys.txt lists all the Unicode characters that correspond to a sequence of collation elements.
This table is constructed to be consistent with the Unicode Canonical Equivalence algorithm, and to respect the Unicode Character Properties. It is not, however, merely algorithmically derivable from those data, since the assignment of levels does take into account characteristics of particular scripts. For example, in general the combining marks are Level 1 ignorables; however, the Indic vowels are given non-zero Level 1 weights, since they are as significant in sorting as the consonants.
Any character may have variant forms or applied accents which affect collation. Thus, for FULL STOP there are three compatibility variants, a fullwidth form, a compatibility form, and a small form. These get different tertiary weights, accordingly. For more information on how the table was constructed, see Weight Derivation.
The default table contains no French accents or expansions. For many languages, some degree of tailoring is required to match user expectations.
Each of the files consists of a version line followed by a rearrangement line and a series of entries, all separated by newlines. A '%' and any following characters on a line are comments.
The version line is of the form:
<version> := <major>.<minor>.<variant>
The rearrangement line is of the following form. It specifies the characters that are to be rearranged in collation, as discussed above in §3.1.3 Rearrangement.
<rearrangement> := @rearrangeAdd <charList> <charList> := <item> ("," <S> <item> )* <item> := <char> | <char> "-" <char>
In general, this line is determined by the version of the Unicode Standard, since these are fixed character properties. For Unicode 2.1, it is the following:
@rearrangeAdd 0E40-0E44, 0EC0-0EC4
Each entry is a mapping from character(s) to collation element(s), and is of the following form:
<entry> := <chars> <S> <collElement>+ <S> <comment> <collElement> := "[" <var> <char> "." <char> "." <char> "." <char> "]" <var> := "*" | "." <decomp> := "CANONSEQ" | "COMPATSEQ" <comment> := "%" <S> <name> ( ";" <S> <decomp> )?
where "x+" indicates one or more x's, "x*" indicates zero or more x's, "x?" indicates zero or one x, <char> is a four-digit hexadecimal Unicode code value, and <S> is whitespace. (The <decomp> tags are informative: CANONSEQ indicates a canonical equivalence, and COMPATSEQ a compatibility equivalence.) Here are some selected entries:
0020 [*0209.0020.0002.0020] % SPACE 02DA [*0209.002B.0002.02DA] % RING ABOVE; COMPATSEQ 0041 [.06D9.0020.0008.0041] % LATIN CAPITAL LETTER A 3373 [.06D9.0020.0017.0041] [.08C0.0020.0017.0055] % SQUARE AU; COMPATSEQ 00C5 [.06D9.002B.0008.00C5] % LATIN CAPITAL LETTER A WITH RING ABOVE; CANONSEQ 212B [.06D9.002B.0008.212B] % ANGSTROM SIGN; CANONSEQ 0042 [.06EE.0020.0008.0042] % LATIN CAPITAL LETTER B 0043 [.0706.0020.0008.0043] % LATIN CAPITAL LETTER C 0106 [.0706.0022.0008.0106] % LATIN CAPITAL LETTER C WITH ACUTE; CANONSEQ 0044 [.0712.0020.0008.0044] % LATIN CAPITAL LETTER D
The entries in each file are ordered by collation element, not by character. This makes it easy to see the order in which characters would be collated.
Although this document describes collation elements as three levels, the file contains a fourth level (as in [.0712.0020.0008.0044]) which is identical with the character (or its decomposition, if there is one). For more information on that, see Avoiding Normalization under Implementation Notes.
Implementations can also add more customizable levels, as discussed above under conformance. For example, an implementation might want to be capable not only of handling the standard Unicode Collation, but also capable of emulating an EBCDIC multi-level ordering (having a fourth-level EBCDIC binary order).
The precise values of the collation elements for the characters may change over time as new characters are added to the Unicode Standard. If a reference to the precise values is required, the version number must be referenced. |
Some of the elements are marked with an asterisk, such as:
Character |
Collation Element |
Name |
---|---|---|
0020 " " | [*0209.0020.0002] | SPACE |
The asterisk indicates that the collation element can have two different values. By default, the stated weights are ignored, and it is interpreted as a Level 3 ignorable (e.g. [0000.0000.0000]). However, an API or tailoring may allow users to toggle a switch making these characters non-ignorable: resulting in simply [0209.0020.0002].
The main algorithm has four steps. First is to normalize each input string, second is to produce an array of collation elements for each string, and third is to produce a sort key for each string from the collation elements. Two sort keys can then be compared with a binary comparison; the result is the ordering for the original strings.
Produce a normalized form of each input string, applying the following three steps:
Example:
input string: | càb |
normalized string: | ca`b |
Step 1 is transforming the string into its canonical decomposition, as specified on page 3-7 of The Unicode Standard, Version 2.0. Note that as clarified in Unicode 2.1 (Unicode Technical Report #8), this includes Hangul Syllable decomposition. The canonical decomposition is also called Normalization Form D (Unicode Technical Report #15). |
If you are only handling a restricted subset of 10646/Unicode which does not include combining marks, you can omit Step 1.1. See the Implementation Notes. To allow for such omission, the Default Unicode
Collation Element Table does define values for characters (such as accented
characters or Hangul Syllables) which would otherwise be normalized to a
sequence of characters. For example,
|
First we define a combining mark in a string to be blocked if there is another combining mark of the same canonical combining class or zero between it and its base character.
The collation element array is built by sequencing through the normalized form as follows:
normalized string: | ca`b |
collation element array: | [0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002] |
On some browsers the grave (`) will display as a single left quote. It should be read as a grave accent for the purposes of this document. |
The reason for considering the extra combining marks C is that otherwise irrelevant characters could interfere with matches in the table. For example, suppose that the contraction <a, combining_ring> (= å) is ordered after z. If a string consists of the three characters <a, combining_ring, combining_cedilla>, then the normalized form is <a, combining_cedilla, combining_ring>, which separates the a from the combining_ring. If we didn't have the step of considering the extra combining marks, this string would compare incorrectly as after a and not after z. If the desired ordering treats <a, combining_cedilla> as a contraction which should take precedence over <a, combining_ring>, then an additional mapping for the combination <a, combining_ring, combining_cedilla> can be introduced to produce this effect. |
For conformance to Unicode canonical equivalence, only unblocked combining marks are matched. For example, <a, combining_macron, combining_ring> would compare as after a-macron, and not after z. As in the previous note, additional mappings can be added to customize behavior. |
The sort key is formed by successively appending weights from the collation element array. The weights are appended from each level in turn, from 1 to 3. (Backwards secondary weights are inserted at a push_position instead of appended.) Next, the normalized Unicode string is appended, forming a 4th level.
An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation level. For example, if the maximum level is set to 2, then level 3 and higher weights (including the normalized Unicode string) are not appended to the sort key. Thus any differences at levels 3 and higher will be ignored, leveling any such differences in string comparison.
Here is a more detailed statement of the algorithm:
The level separator is zero (0000), which is guaranteed to be lower than any weight in the resulting sort key. This guarantees that when two strings of unequal length are compared, where the shorter string is a prefix of the longer string, the longer string is always sorted after the shorter (in the absence of special features like contractions). For example: |
Even if two different Unicode strings do not differ in level 1, 2 or 3 weights, the addition of the normalized string at the end will distinguish between them, unless they normalize to precisely the same string. This is how the normalized string defines a fourth level. |
The reason for having a push_position is to support French secondaries. Suppose that you have 6 secondary weights abcdefg, where c, d, and e are backwards. The weights are progressively added to the sort key as follows (where the dot represents the push_position).
Conformant implementations must support backwards collation elements at Level 2. Support at any other level is optional. |
collation element array: | [0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002] |
sort key: | 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062 |
Compare the sort keys for each of the imput strings, using a binary comparison. This means that:
String |
Sort Key |
---|---|
cab | 0706 06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 0000 0063 0061 0062 |
Cab | 0706 06D9 06EE 0000 0020 0020 0020 0000 0008 0002 0002 0000 0043 0061 0062 |
càb | 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062 |
dab | 0712 06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 0000 0064 0061 0062 |
In this example, "cab" <3 "Cab" <2 "càb" <1 "dab". The differences that produce the ordering are shown by the bold underlined items:
At this point we can explain the reason for disallowing ill-formed weights. If ill-formed weights were allowed, the ordering of elements can be incorrectly reflected in the sort key. For example, suppose the secondary weights of the Latin characters were zero (ignorable) and that (as normal) the primary weights of case-variants are equal: that is, a1 = A1. Then the following incorrect keys would be generated:
Since the secondary weights for a, A, and e are lost in forming the sort key, the relative order of the acute is also lost, resulting in an incorrect ordering based solely on the case of A vs a. With well-formed weights, this does not happen, and you get the following correct ordering:
However, there are circumstances--typically in expansions--where higher-level weights in collation elements can be zeroed (resulting in ill-formed collation elements) without consequence (see Implementation Notes). Implementations are free to do this as long as they produce conformant results. |
This section describes the mechanism for tailoring the default ordering for a particular language. A tailoring provides a specification for customizing the default ordering by:
Every conformant implementation must support (1) reordering and (2) setting secondaries backwards. The other tailoring options are optional. Implementations are strongly encouraged to offer them, but are not required to for conformance.
The following specifies the format for tailoring files. A tailoring file consists of version line followed by a list of entries, all separated by newlines. The ordering among the list of entries may be significant. Extra whitespace is not significant. A "%" character indicates a comment.
The version has the following format:
<version> := <major>.<minor>.<variant>
Each entry can have several forms, based on the following:
<entry> := <rearrangement> | <direction> | <variable> | <positioning>
There are three forms of entries, described below. Each entry has a cumulative effect on the collation table. Thus where there are conflicts between any entries, the last one modifies the cumulative state so far, overriding earlier entries.
A character range is used to specify a set of characters, either directly or through use of a Unicode category. The category is a two-letter abbreviation for types of Unicode characters (such as Lu = Uppercase Letter) as described in the Unicode Character Database readme.txt file. Note that the element <chars> usually represent a contraction.
<characterRange> := '[' <item> ("," <item>)* ']' <item> := <chars> | <range> | <category> <range> := <char> '-' <char> <category> := '{' <unicode category> '}'
For example, here are two equivalent character ranges that include the ASCII characters, plus á, and Dash Punctuation:
[0020-007E, 00E1, {Pd}]
[0020-007E, 0041 030A, {Pd}]
Commas are significant: [0063, 0068] is a set with two values, c and h, while [0063 0068] is a set with one value, the contraction ch.
In all of the tailoring specification, any canonical decomposible character specification is handled by replacing it by its canonical decomposition. This decomposition is then treated as a contraction in processing. For example, "00E1" (á) is expanded to "0041 030A" (a,´). |
Each rearrangement entry adds to or removes characters from the list of rearranging characters.
<rearrangement> := <rearrangeAdd> | <rearrangeRemove> <rearrangeAdd> := '@rearrangeAdd ' <characterRange> <rearrangeRemove> := '@rearrangeRemove ' <characterRange>
The direction setting determines whether characters are to be appended to the sort key or inserted at the push_position.The default setting is OFF.
The direction lists restrict the list of items which are subject to the direction setting. If no list options appear, or if the implementation does not support direction lists, then secondary weights for all combining marks are set to be backwards.
Each backward list adds to or removes characters from the list of backward characters; each forward list adds to or removes characters from the list of forward characters. Any characters that are not in either the backwards or forwards lists are neutral. (Implementations may chose to offer direction entries for more than just secondaries--in that case a level number is appended to each of the following keywords.)
<direction> := <directionSetting> | <directionList> <directionSetting> := '@backwardOn' | '@backwardOff <directionList> := <backwardAdd> | <backwardRemove> | <forwardAdd> | <forwardRemove> <backwardAdd> := '@backwardAdd ' <characterRange> <backwardRemove> := '@backwardRemove ' <characterRange> <forwardAdd> := '@forwardAdd ' <characterRange> <forwardRemove> := '@forwardRemove ' <characterRange>
For example, here are lines that turn on backwards direction, and set the affected characters to be the French accents:
@backwardOn @backwardAdd [0300-0302, 0308] %French accents
It is important that all significant characters used by a backwards-secondary language be marked as either backwards or neutral, even those that do not have any accents. In practice, this means that any character with a secondary weight not equal to MIN2 should have a strong secondary direction, either forwards or backwards. If this is not followed, the reversal of the accents in Step 3 of the algorithm will be broken up. For example, if combining_acute is marked as backwards and e, a and x are either backwards or neutral, then "éxa" correctly orders as less than "exá". Here are the corresponding sort keys:
However, if the x is marked as forwards, then it will sort incorrectly (for French) as "éxe" > "exé". Here are the corresponding sort keys showing the incorrect order.
Since the x is not marked as backwards, it interupts the reversal. |
Variable elements can be set to ignorable with a single entry. If the variable is ON, then variable elements are not ignorable. If it is OFF, then they are Level 3 Ignorables. The default setting is OFF. Setting the value to ON is the equivalent of having a large tailoring table that resets the values to be all non-ignorable. The set of variable elements can also be tailored with the syntax below.
<variable> := <variableSetting> | <variableList> <variableSetting> := @variableOn | @variableOff <variableList> := <variableAdd> | <variableRemove> <variableAdd> := '@variableAdd ' <characterRange> <variableRemove> := '@variableRemove ' <characterRange>
For example, here are lines that turn on variables, but have miscellaneous symbols and dingbats sort as non-ignorables.
@variableOn @variableRemove [2600-27FF] %Misc. Symbols & Dingbats
Positioning entries specify a reordering of the characters in the Collation Element Table, and are the most significant part of any tailoring.
<positioning> := <operand> <relation> <position> <operand> := <chars> <relation> := '>' | '>>' | '>>>' | '=' <position> := <contraction> | 'null'
The null positioning values are explained under Special Positioning Values below. The relations determine the strength of the ordering between the operand and the position, as shown in the table below.
Relation |
Action |
---|---|
'>' | make primary greater than (corresponding to >1) |
'>>' | make secondary greater than (corresponding to >2) |
'>>>' | make tertiary greater than (corresponding to >3) |
'=' | make identical in L1-L3 (corresponding to =3) |
A conformant implementation of this algorithm need not supply tailoring at a fourth or higher levels. However, should one do so, it is recommended that the additional relations be the natural extensions of the ones in the above table. |
The mapping of characters to collation elements creates an ordering of the characters, where each one is greater than or equal to the one before it. The exact relationship will be based on the relationship between the corresponding collation elements: identical, tertiary greater than, secondary greater than, or primary greater than.
For example, here is a except from the Default Unicode Collation Element Table. For brevity, a large number of intervening lines are omitted.
0030 [.06CF.0020.0002] % DIGIT ZERO 0031 [.06D0.0020.0002] % DIGIT ONE 00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D1.0020.0002] % DIGIT TWO 0033 [.06D3.0020.0002] % DIGIT THREE 0034 [.06D5.0020.0002] % DIGIT FOUR ...
This establishes the following pairs of ordering (in these examples, "@" stands for the DINGBAT NEGATIVE CIRCLED DIGIT ONE since it is not displayable in HTML):
Sample Ordering |
---|
'0' <1 '1', '1' <3 '1', '1' <2 '@', '@' <1 '2', '2' <1 '3', '3' <1 '4' |
A positioning entry reorders the operand, removing it from where it was, and adding it at a new position. This new position is after the indicated position, and any orderings that are of lesser strength. Any item that previously had an ordering immediately after the new position are now set to be after the new operand. Any item that previously had an ordering immediately after the operand is now set to be after the item that had preceded the operand. The strength of the latter ordering is the stronger of the two previous orderings (recall that levels that are numerically smaller are stronger).
For example, suppose that A <3 B, B <2 C, S <j T, and T <k U. Then the positioning entry represented by "T >> A" will insert T between B and C, giving the following ordering: A <3 B, B <2 T, T <n C, and S <m U (where m is the smaller of j and k). Notice that T did not go immediately after the A, since that would have invalidated the strength of the ordering between A and B.
For example, here is the results of a tailoring entry, with the differences in boldface.
Original | After 0033 >> 0031 |
---|---|
'0' <1 '1', '1' <3 '1', '1' <2 '@', '@' <1 '2', '2' <1 '3', '3' <1 '4' |
'0' <1 '1', '1' <3 '1', '1' <2 '3', '3' <2 '@', '@' <1 '2', '2' <1 '4' |
The order of the entries in the tailoring table is very significant. For example, the following entries result in the ordering a <1 d <1 c, since the d is inserted immediately after the a. 0062 > 0061 % c > a 0064 > 0061 % d > a This is not the same as the same entries in reversed order, which results in the ordering a <1 c <1 d. 0064 > 0061 % d > a 0062 > 0061 % c > a |
All contractions must be explicitly specified. In particular, case variants of contractions must be considered. For example, if you are adding
ch > c % separate letter between c and d
you must also add the combinations of case variants for them to be properly ordered:
cH >>> ch Ch >>> cH CH >>> CH
There is a special value defined for use as a position:
Positioning Value |
Collation Element |
Meaning |
---|---|---|
null | [.0000.0000.0000.] | Collation Element with all ignorable weights |
The collation element resulting from a positioning relative to null will depend on the ordering relationship that is used. For example:
Entry |
Resulting Collation Element |
---|---|
0033 >>> null % '3' >>> null |
[.0000.0000.0001.] |
0033 >> null % '3' >> null |
[.0000.0001.0001.] |
0033 > null % '3' > null |
[.0001.0001.0001.] |
This value can be used to produce ignorables at any level. For example, the following entries will produce the corresponding collation elements:
x >>> null % produces [.0000.0000.0001.] y >>> x % produces [.0000.0000.0002.] z >> y % produces [.0000.0001.0001.]
Once the tailoring file has be read and the orderings produced, it must be processed further to produce the collation element mappings.
In certain cases of tailoring, a sequence of N characters are treated as contraction, and yet the initial N-1 characters are not. If these sequences end with more than one combining mark, the Collation Element Table must be "completed". This completion requires the addition of an extra entry for the initial N-1 character substring (and is repeated recursively if that new entry requires completion). This step is only required in unusual circumstances, where characters with multiple accents are given special orderings.
For example, suppose that the table is set up so that the contraction <a diaeresis macron> sorts as a single character after e.
e <1 <a diaeresis macron>
Then even if <a diaeresis> has no special sorting behavior, an entry must still be added for it in the Collation Element Table. This entry, however, will simply be an identity mapping.
a diaeresis = a diaeresis
It will result in the following mapping to the normal collation elements:
a diaeresis => [a1, a2, a3] [diaeresis1, diaeresis2, diaeresis3]
This completion phase is required to allow for intervening combining mark handling in Step 2 of the main algorithm. It allows for the sequence <a diaeresis macron> to be successively mapped in that step, without stopping at the diaeresis.
(As in the rest of this document, this is a logical description of how the algorithm must behave--the Collation Element Table and algorithm can be restructured arbitrarily as long as the results of comparing strings according to the the new algorithm and table are the precisely the same.)
At the end of the process of applying all of the entries to the table and doing any necessary completion, a new set of collation elements will need to be generated from the new ordering according to the following process. For each character sequence in the ordering, an associated list of collation elements will be produced. In the normal case, this list will consist of exactly one collation element, but in the case of expansions it will consist of more (indicated by the sequence R1,...,Rn below).
The first collation element F must be defined for the ordering to be valid. The subsequent ones R1,...,Rn may not have been assigned at this point in the process. If not, their values are substituted in a second pass. |
For example, suppose we have the ordering:
ae <3 a combining_diaeresis
The collation element for a combining_diaeresis (= ä) will be generated as:
a combining_diaeresis => [a1, a2, a3+1], [e1, e2, e3]
(The collation element for e may not yet have been determined at this point in the resolution--however, its value will be substituted in a second pass.)
Once the ordering has been used to generate a new mapping to collation elements, the direction and variable sets are applied. Every character sequence in the mapping is tested against these two sets. If it is in one of the sets then the collation element is appropriately marked. If a direction or variable set contains a character sequence that is not in the ordering, then that character sequence is ignored. For example, if there is no <a, ring> contraction in the ordering, then setting the direction of <a, ring> to be backwards will have no effect.
As a result of this process, a new set of collation elements are generated that maintain the new tailored order, and have the correct direction and variable status.
If precomposed characters are to be given weights for faster processing of Normalization Form C text (see Unicode Technical Report #15, "Unicode Normalization Forms"), then their weights are now generated from the decomposed collation elements, as described in Secondary Weight Table Derivation below.
If any sequence of combining characters is marked as backwards, then that sequence must be sorted in the reverse order of characters in the sequence. Because of this, computed backwards secondary weights for precomposed characters must be generated differently than if they are forwards. Here is an example that shows the difference between the computed weights.
Weight | All Forwards | All Backwards |
---|---|---|
1 2 3 4 5 6 7 8 |
grave acute circumflex circumflex grave circumflex acute breve breve grave breve acute |
grave circumflex grave breve grave acute circumflex acute breve acute circumflex breve |
In both cases, the individual combining marks (shown in boldface) are in the same order. In the second case (all backwards), however, a sequence of combining marks is placed after the last character of the sequence instead of the first.
Implementations may choose to support grouping, whereby groups of characters are inserted in the tailoring ordering. Grouping preserves the ordering among the elements of the group, and inserts them as a whole at the appropriate position. Where this is offered, the syntax above is extended as follows:
<operand> := <chars> | <characterGroup> <characterGroup> := <chars> '-' <chars>
For example:
Entry |
Resulting Order |
---|---|
0033-0034 > 0031 % '3'-'4' > '1' |
'0' <1 '1' <1 '3' <3 '3' ... <1 '4' <1 '1' <2 '@' <1 '2' |
In addition to tailoring, some implementation may choose to preprocess the text for special purposes. Once such preprocessing is done, the standard algorithm can be applied.
Examples include:
Such preprocessing is outside of the scope of this document.
As noted above for efficiency, implementations may vary from this logical algorithm so long as they produce the same result. The following items discuss various techniques that can be used for reducing sort key length, and customizing for addtional environments. The topics include:
String |
Sort Key |
---|---|
càb (0) | 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062 |
càb (1) | 0706 06D9 06EE 0020 0020 0021 0020 0002 0002 0002 0002 0000 0063 0061 0300 0062 |
String |
Sort Key |
---|---|
càb (0) | 07 06 06 D9 06 EE 00 00 00 20 00 20 00 21 00 20 00 00 00 02 00 02 00 02 00 02 00 00 00 63 00 61 03 00 00 62 |
càb (1,2) | 07 06 06 D9 06 EE 20 20 21 20 02 02 02 02 00 00 00 63 00 61 03 00 00 62 |
String |
Sort Key |
---|---|
càb (0) | 07 06 06 D9 06 EE 00 00 00 20 00 20 00 21 00 20 00 00 00 02 00 02 00 01 00 02 00 00 00 63 00 61 03 00 00 62 |
càb (1,2) | 070606D9 06EE2020 21200202 01020000 00630061 03000062 |
Java 1.1 implements a subset of the features described in this document. For those who are interested, the main differences are:
Unicode Collation Algorithm Tailoring | Java |
---|---|
x > y | & y < x |
x >> y | & y ; x |
x >>> y | & y , x |
x = y | & y = x |
Unicode Collation Algorithm Tailoring |
Java |
---|---|
0041 >>> 0061 % A >>> a 00E0 >> 0041 % à >> A 00C0 >>> 00E0 % À >>> à 0042 > 00C0 % b > À 0062 >>> 0042 % B >>> b |
a, A ; à, À < b, B |
This section describes the generation of the Unicode Default Unicode Collation Element Table. Certain weights of characters are algorithmically generated from the Unicode Character Database on ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt (the format is explained on ftp://ftp.unicode.org/Public/UNIDATA/README.TXT). The following describes the process of producing these derived weights for a particular Collation Element Table. It is used for all characters except those explicitly tailored.
CJK Ideographs are given a primary weight equal to their Unicode character value. Their secondary and tertiary values are the respective minimal values for those levels: in the Default Unicode Collation Element Table, these are 0020 and 0002 respectively.
Characters without decompositions are given tertiary weights according to their General Category in the Unicode Character Database:
General Category |
Weight |
---|---|
Lu (uppercase Letter) | 0008 |
anything else | 0002 |
Characters whose decompositions are of the form base character + multiple non-spacing mark(s) have primary and tertiary weights which are the same as those of the base character. They are given a secondary weight which is derived from the secondary weights of the sequence of non-spacing marks according to a Secondary Weight Table.
The Secondary Weight Table is derived in the following way.
For example, here is the derivation for 1F83 GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI in the Default Unicode Collation Element Table. Notice that the base character's original secondary weight (0020 = MIN2) is completely disregarded in this process.
Character |
Collation Element |
Resulting Collation Element |
Name |
---|---|---|---|
03B1 | [0961.0020.0002] | [0961.0020.0002] | GREEK SMALL LETTER ALPHA |
+ 0314 | + [0000.0034.0002] | COMBINING REVERSED COMMA ABOVE = Dasia | |
map(0020,0034) => 0034 | [0961.0034.0002] | ||
+ 0300 | + [0000.0021.0002] | COMBINING GRAVE ACCENT = Varia | |
map(0034,0021) => 011B | [0961.011B.0002] | ||
+ 0345 | + [0000.0061.0002] | COMBINING GREEK YPOGEGRAMMENI = Iota subscript | |
map(011B,0061) => 0131 | [0961.0131.0002] |
If a Collation Element Table has been tailored so that non-spacing marks are reordered, then it will result in a different set of weights in the Secondary Weight Table.
Characters with other decompositions are given tertiary weights based upon their compatibility tag. The collation element is the one or more collation elements of the decomposition, with the tertiary weight reset according to the following table:
For example, from the Unicode Character Property database, SQUARE PA AMPS has a decomposition of 0070 0041, with a compatibility tag of <square>. The collation elements for this sequence is [.083C.0020.0002] [.06D9.0020.0008]. The tertiary weights are then reset according to the Tertiary Weight Table, resulting in the collation elements [.083C.0020.0016] [.06D9.0020.0017].
Digits are given special-case handling. The primary value is assigned based on the numerical value of the digit, and secondary values are derived from the script differences. Numerics that have compatibility equivalents are given weights based upon those equivalents; for example, ROMAN NUMERAL IX is given weights near those of the sequence "I", "X". Numerics that are neither digits nor have compatibility equivalents are treated as miscellaneous symbols.
There are two important points to remember when doing weight derivations for decompositions.
Q: Does JIS require tailorings?
A: The Default Unicode Collation Element Table uses the Unicode order for CJK ideographs (Kanji). This represents a radical-stroke ordering for the characters in JIS levels 1 and 2. If a different order is needed, such as an exact match to binary JIS order for these characters, that can be achieved with tailoring.
Q: How are Hiragana readings handled for Kanji?
A: There is no algorithmic mapping from Kanji characters to the phonetic readings for those characters, because there is too much linguistic variation. The common practice for sorting in a database by reading is to store the reading in a separate field, and construct the sort keys from the readings.
Q: Is transitive consistency maintained?
A: Yes, for any strings A, B, and C, if A < B and B < C, then A < C. However, implementors must be careful to produce implementations that accurately reproduce the results of the Unicode Collation Algorithm as they optimize their own algorithms. It is easy to perform careless optimizations--especially with Incremental Comparison algorithms--that fail this test. Other items to check are the proper distinction between the bases of accents, such as in the following example:
Q: How are mixed Japanese and Chinese handled?
A: The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformally, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation).
Q: Are the half-width katakana properly interleaved with the full-width?
A: Yes, the Default Unicode Collation Element Table properly interleaves half-width katakana, full-width katakana, and full-width hiragana. They also interleave the voicing and semi-voicing marks correctly, whether they are precomposed or not.
Q: How are names in a database sorted properly?
A: An important feature of international sorting is that it makes a big difference whether strings are considered in isolation or not. Suppose that your database is sorted first by last name, then by first name. Since they are sorted first, a secondary or tertiary difference in the last name will completely swamp a primary difference in the first name. So "Zelda Casares" will sort before "Albert Cásares". If this behavior is not desired, then the database should be sorted by a constructed field which contains last name + ',' + first name. This will end up sorting the record with "Cásares, Albert" before the one with "Casares, Zelda".
Q: Can the katakana length mark be handled properly, as described in Section 5.15 of the Unicode Standard?
A: Yes, by using a combination of contraction and expansion, the length mark can be sorted differently, according to the vowel of the previous katakana character.
Q: Is this done in the Default Unicode Collation Element Table?
A: No, the Default Unicode Collation Element Table does not have any contractions. These can be added in tailoring, with lines of the following form. Notice that these are both contractions and expansions.
30AB 30FC >>> 30AB 30A1 % KA <length_mark> is after KA A 30AD 30FC >>> 30AB 30A4 % KI <length_mark> is after KI I ...
Copyright © 1998-1998 Unicode, Inc. All Rights Reserved.
The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions. Java is a trademark of Sun Microsystems, Inc.
Unicode Home Page: http://www.unicode.org
Unicode Technical Reports: http://www.unicode.org/unicode/reports/