Draft Unicode Technical Report #10
Unicode® Collation Algorithm

Revision 1.0

Authors Mark Davis and Ken Whistler

Date 1997-03-30

This Version http://www.unicode.org/unicode/reports/tr10/index.html

Previous Version none

Latest Version same as this version

This report presents the specifications of the Unicode Collation Algorithm. It is divided into the following sections:

Scope
Conformance
Data Table Format
Main Algorithm
Weight Derivation
Tailoring
Implementation Notes

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative, or as normative specification. Please mail corrigenda and other comments to Unicore@Unicode.org.

Scope

The Unicode Collation Algorithm describes how to compare two Unicode strings, using the data in a Collation Element Table, and consistent with the linguistic requirements of The Unicode Standard, Version 2.0, Section 5.15 Sorting and Searching. (Readers should be familiar with that section before proceeding.)

Goals

The algorithm is designed to satisify the following goals.

A complete, unambiguous, specified ordering for all characters in Unicode.
A complete resolution of the handling of canonical and compatibility equivalences as relates to the default ordering.
A complete specification of the meaning and assignment of collation levels, including whether a character is ignorable by default in collation.
A complete specification of the rules for using the level weights to determine the default collation order of strings of arbitrary length.
A well-defined mechanism (tailorability) to override the default order to create language-specific orderings in detail.
An algorithm that can be efficiently implemented, both in terms of performance and in terms of memory requirements.

Given the standard ordering and the tailoring for any particular country, any two companies or individuals--with their own proprietary implementations--could take any arbitrary Unicode input and produce exactly the same sorted output. In addition, when given a tailoring specifying French accents this algorithm passes the Canadian (and ISO DIS 14651) benchmark, as mentioned in the section on Processing French Accents.

Non-Goals

The Default Table explicitly does not provide for the following features:

reversibility: from a Collation Element you are not guaranteed that you can recover the original character.
numeric formatting: That numbers composed of a string of digits or other numerics will not necessarily sort in numerical order.
API: No particular API is specified or required for the algorithm.
title sorting: removing articles such as a and the.

Summary

Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a data table. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be compared bit-by-bit to give the correct comparison between the strings for which they were generated.

The Unicode Collation Algorithm assumes multiple-level key weighting, along the lines widely implemented in IBM technology, and as described in the Canadian sorting standard (CSA Z243.4) and the proposed International String Ordering standard (ISO/IEC 14651).

By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:

alphabetic ordering
diacritic ordering
case ordering.

A fourth level is used for tie-breaking between strings not distinguished at the first three levels; by default the fourth level uses weights defined to be the Unicode code value of a character. The fourth level weights for a character allow partial customization in that they can be reset to be the Unicode code value of a different character.

This design, called 3-plus levels, allows implementations to produce culturally acceptable collation, while putting the least burden on implementations in terms of memory requirements and performance. In particular, data tables for the Unicode Collation Algorithm only require storage of 32 bits per significant character.

However, implementations of the Unicode Collation Algorithm are not limited to supporting only 3-plus levels. They are free to support a fully customizable 4th level (or more levels), as long as they can produce the same results as the basic algorithm, given the right data tables. For example, an application which uses the algorithm, but which must treat some collection of special characters as ignorable at the first 3 levels and must have those specials collate in non-Unicode order (as, for example to emulate an existing EBCDIC-based collation), may choose to have a fully customizable 4th level. The downside of this choice is that such an application will require more storage, both for the data table and in constructed sort keys.

A tailoring mechanism is described for adjusting the data table to produce particular culturally required orderings for different languages or locales. This tailoring mechanism allows for full tailoring of the first three levels and a limited tailoring of the fourth level. As in the algorithm itself, the tailoring mechanism can be extended to provide full customization for four (or more) levels.

Conformance

There are many different ways to compare strings, and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement the Unicode Collation Algorithm must do so as described in this document.

The algorithm is a logical specification, designed to be straightforward to describe. Actual implementations of the algorithm are free to change any part of the algorithm so long as any two strings compared by the implementation are ordered the same as they would be by the algorithm. See Implementation Notes. The sort key is also a logical intermediate object: so long as an implementation produces the same results in comparison of strings, the sort keys can differ in format from what is specified here.

The requirements for conformance are as follows:

A conformant implementation of the Unicode Collation Algorithm shall be able to replicate the same comparisons of Unicode strings as those implied by the algorithm as described in this document, using the Default Unicode Collation Element Table.
When supplied with tailoring information as described in this document, a conformant implementation of the Unicode Collation Algorithm shall be able to replicate the same comparisons of Unicode strings as those implied by the algorithm as described in this document, using the Default Unicode Collation Element Table and the specified tailoring information.

A conformant implementation of this algorithm is only required to implement 3-plus levels. However, a conformant implementation may choose to implement 4 (or more) levels.

Data Table Format

A Collation Element Table contains a mapping from one (or more) characters to one (or more) collation elements. A collation element is an ordered list of three 16-bit weights. (Implementations can produce the same result without using 16-bit weights--see Implementation Notes.)

The first weight is called the Level 1 weight (or primary weight), the second is called the Level 2 weight (secondary weight), and the third is called the Level 3 weight (tertiary weight). For a collation element X, these can be abbreviated as L1(X), L2(X), and L3(X).

Given two collation elements X and Y:

X < Y (X is primary less than Y)
when L1(X) < L1(Y)
X << Y (X is secondary less than Y)
when L1(X) = L1(Y) and L2(X) < L2(Y)
X <<< Y (X is tertiary less than Y)
when L1(X) = L1(Y), L2(X) = L2(Y) and L3(X) < L3(Y)
X = Y (X is equal to Y)
when L1(X) = L1(Y), L2(X) = L2(Y) and L3(X) = L3(Y)

The collation algorithm results in a similar ordering among characters and strings, so that for two strings A and B we can write A << B, meaning that A is greater than B and there is only a secondary difference between them. If two strings have no primary, secondary or tertiary difference according to a given Collation Table, then we write A = B. If, in addition, they are equivalent under the Unicode Canonical Equivalence Algorithm and any tailorings equivalents, we write A == B. If they are bit-for-bit identical, we write A === B.

If a weight is 0000, than that collation element is ignorable at that level: the weight at that level is not taken into account in sorting. A collation element that is ignorable at Level 1, but not at Level 2 is called a Level 1 ignorable; similarly, a Level 2 ignorable is ignorable at Levels 1 and 2, but not 3, and a Level 3 ignorable is ignorable at Levels 1, 2, and 3. A collation element that is not ignorable at Level 1 is called a non-ignorable.

To simplify the statement of the algorithm, only initial weights in a collation element can be 0000; no interior or trailing weights can be 0000. In other words, if any collation element has a non-zero weight, no subsequent weights in the collation element can be zero. Any collation element that violates this rule is called ill-formed.

The following are sample collation elements that are used in the examples illustrating the algorithm. Unless otherwise noted, all weights are in hexadecimal format.

Sample Table:

Character

Collation Element

Name

0300 "`" [0000.0021.0002] COMBINING GRAVE ACCENT

0061 "a" [06D9.0020.0002] LATIN SMALL LETTER A

0062 "b" [06EE.0020.0002] LATIN SMALL LETTER B

0063 "c" [0706.0020.0002] LATIN SMALL LETTER C

0043 "C" [0706.0020.0008] LATIN CAPITAL LETTER C

0064 "d" [0712.0020.0002] LATIN SMALL LETTER D

*Sample Table:*
Character	Collation Element	Name
0300 "`"	`[0000.0021.0002]`	COMBINING GRAVE ACCENT
`0061 "a"`	`[06D9.0020.0002]`	LATIN SMALL LETTER A
`0062 "b"`	`[06EE.0020.0002]`	LATIN SMALL LETTER B
`0063 "c"`	`[0706.0020.0002]`	LATIN SMALL LETTER C
`0043 "C"`	`[0706.0020.0008]`	LATIN CAPITAL LETTER C
`0064 "d"`	`[0712.0020.0002]`	LATIN SMALL LETTER D

Default Unicode Collation Element Table

The Default Unicode Collation Element Table is available in two parts:

http://www.unicode.org/unicode/reports/tr10/basekeys.txt

http://www.unicode.org/unicode/reports/tr10/compkeys.txt

The file basekeys.txt lists all the Unicode characters that correspond to a single collation element. The file compkeys.txt lists all the Unicode characters that correspond to a sequence of collation elements.

This table is constructed to be consistent with the Unicode Canonical Equivalence algorithm, and to respect the Unicode Character Properties. It is not, however, merely algorithmically derivable from those data, since the assignment of levels does take into account characteristics of particular scripts. For example, in general the combining marks are Level 1 ignorables; however, the Indic vowels are given non-zero Level 1 weights, since they are as significant in sorting as the consonants.

Any character may have variant forms or applied accents which affect collation. Thus, for FULL STOP there are three compatibility variants, a fullwidth form, a compatibility form, and a small form. These get different tertiary weights, accordingly.

File Format

Each of the files consists of a version line followed by a series of entries, all separated by newlines. The version line is of the form:

<version> := <major>.<minor>.<variant>

The entries are each of the form:

<entry>  := <char>+<SP><collElement>+<SP>"%"<SP><name>{";"<SP><decomp>}



<decomp> := <decompType> {"SEQ"}



<decompType> := "CANON" | "COMPAT"

where "x+" indicates one or more x's, {x} indicates zero or one x, and SP is whitespace. Each entry is a mapping from character(s) to collation element(s). Here are some selected entries:

0020 [*0209.0020.0002.0020] % SPACE



02DA [*0209.002B.0002.02DA] % RING ABOVE; COMPATSEQ



0041 [.06D9.0020.0008.0041] % LATIN CAPITAL LETTER A



3373 [.06D9.0020.0017.0041] [.08C0.0020.0017.0055] % SQUARE AU; COMPATSEQ



00C5 [.06D9.002B.0008.00C5] % LATIN CAPITAL LETTER A WITH RING ABOVE; CANONSEQ



212B [.06D9.002B.0008.212B] % ANGSTROM SIGN; CANONSEQ



0042 [.06EE.0020.0008.0042] % LATIN CAPITAL LETTER B



0043 [.0706.0020.0008.0043] % LATIN CAPITAL LETTER C



0106 [.0706.0022.0008.0106] % LATIN CAPITAL LETTER C WITH ACUTE; CANONSEQ



0044 [.0712.0020.0008.0044] % LATIN CAPITAL LETTER D

The entries in each file are ordered by collation element, not by character. This makes it easy to see the order in which characters would be collated. The <decomp> information is informative.

Although this document describes collation elements as three levels, the file contains a fourth level (as in [.0712.0020.0008.0044]) which is identical with the character (or its decomposition, if there is one). For more information on that, see Avoiding Normalization under Implementation Notes.

Implementations can also add more customizable levels, as discussed above under conformance. For example, an implementation might want to be capable not only of handling the standard Unicode Collation, but also capable of emulating an EBCDIC multi-level ordering (having a fourth-level EBCDIC binary order).

Variable Collation Elements

Some of the elements are marked with an asterisk, such as:

Character

Collation Element

Name

0020 " " [*0209.0020.0002] SPACE

Character	Collation Element	Name
`0020 " "`	`[*0209.0020.0002]`	SPACE

The asterisk indicates that the collation element can have two different values. By default, the stated weights are ignored, and it is interpreted as a Level 3 ignorable (e.g. [0000.0000.0000]). However, an API may allow users to toggle a switch making these characters non-ignorable: resulting in simply [0209.0020.0002].

Multiple Mappings

The mapping from characters to collation elements may not be a simple mapping from one character to one collation element: in general, it may map from one to many, from many to one, or from many to many. For example:

Expansions

The Latin letter æ is treated as an independent letter by default. Collations such as English, which may require treating it as equivalent to an a+e sequence, can tailor the letter to map to the following collation elements, such as in the following example:

Character

Collation Element

Name

00E6 "æ" [06D9.0020.0002], [073A.0020.0002] LATIN SMALL LETTER AE

Character	Collation Element	Name
`00E6 "æ"`	`[06D9.0020.0002], [073A.0020.0002]`	`LATIN SMALL LETTER AE`

In this example, the collation element [06D9.0020.0002] gives the weight values for a, and the collation element [073A.0020.0002] gives the weight values for e.

Contractions

Similarly, where ch is treated as a single letter as in traditional Spanish, it is represented as a mapping from two characters to a single Collation Element, such as in the following example

Character

Collation Element

Name

0063, 0068 "ch" [0707.0020.0002] LATIN SMALL LETTER C, LATIN SMALL LETTER H

Character	Collation Element	Name
`0063, 0068 "ch"`	`[0707.0020.0002]`	`LATIN SMALL LETTER C, LATIN SMALL LETTER H`

In this example, the collation element [0707.0020.0002] has a primary value one greater than the primary value for the letter c by itself, so that the sequence ch will collate after c and before d. The Default Unicode Collation Element Table does not contain any contractions. The above example shows how a tailoring of collation elements could be used to weight sequences of letters as a single unit.

Other Multiple Mappings

Certain characters may both expand and contract: see page 5-31 of The Unicode Standard, Version 2.0.

Numerics

Digits are given special-case handling. The primary value is assigned based on the numerical value of the digit, and secondary values are derived from the script differences. Numerics that have compatibility equivalents are given weights based upon those equivalents; for example, ROMAN NUMERAL IX is given weights near those of the sequence "I", "X". Numerics that are neither digits nor have compatibility equivalents are treated as miscellaneous symbols.

French Accents

In some languages (notably French), accents are sorted from the back of the string to the front of the string. These accents do not occur in the Default Table, but may occur in tailored tables. For more information, see Processing French Accents below.

Main Algorithm

The main algorithm has four steps. First is to normalize each input string, second is to produce an array of collation elements for each string, and third is to produce a sort key for each string from the collation elements.

Two sort keys can then be compared with a binary comparison; the result is the ordering for the original strings.

1. Normalize each input string

Produce a normalized form of each input string, applying the following three steps:

If any characters have tailoring equivalents, replace them by those equivalents.
Use the Unicode canonical algorithm to decompose characters according to the canonical mappings.
If any character is marked as rearranging (see below), swap it and the succeeding character.

Example:

input string: càb

normalized string: ca`b

Note:

If you are only handling a restricted subset of 10646/Unicode which does not include combining marks, you can omit step 1. See the Implementation Notes.

To allow for such cases, the Default Collation Element Table does define values for characters (such as accented characters or Hangul Syllables) which would otherwise be normalized to a sequence of characters. For example,

Character

Collation Element

Name

00E0 "à" [06D9.0021.0002] LATIN SMALL LETTER A WITH GRAVE

Tailoring Equivalents

Characters can be tailored to map directly to other characters. For example, this can be used to precisely identify V and W or Y and Ü in Swedish; or to add additional decompositions such as mapping ø to o + /. The effect of this step is equivalent to a partially-customizable 4th level, as will be discussed below. For more information, see Tailoring.

Rearrangement

The characters causing rearrangement are the Thai vowels 0E40 through 0E44 and the Lao vowels 0EC0 through 0EC4. For example, here is a normalized string including rearrangement:

input string: 0E01 0E40 0E02 0E03

normalized string: 0E01 0E02 0E40 0E03

2. Produce an array of collation elements for each string

Do this by sequencing through the normalized form as follows:

Find the longest initial substring at each point that has a match in the table.
Append the corresponding collation element(s) from the table to an array of collation elements.
Proceed to the next point in the string (past the initial substring match).
Loop until done

Example:

normalized string: ca`b

collation element array: `[0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002]`

*Example:*
normalized string:	ca`b
collation element array:	`[0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002]`

Note:
On some systems the grave (`) will show up as a single left quote.

Note:	On some systems the grave (`) will show up as a single left quote.

French accents require special processing at this point. For a description of this processing, see Processing French Accents below.

3. Form a sort key for each string

Form the sort key by successively appending weights from the collation element array as follows:

For each weight level L,
- Append a level separator, except at the start
- Then, for each collation element K in the array
  - Append each non-zero weight at level L
At the end, append a level separator and then append the normalized Unicode string. This normalized string defines a fourth level.

Note:

The level separator is zero (0000), which is guaranteed to be lower than any weight in the resulting sort key. This guarantees that when two strings of unequal length are compared, where the shorter string is a prefix of the longer string, the longer string is always sorted after the shorter. For example:

"cab" < "cabX" where "X" can be any character(s)

Note:	The level separator is zero (0000), which is guaranteed to be lower than any weight in the resulting sort key. This guarantees that when two strings of unequal length are compared, where the shorter string is a prefix of the longer string, the longer string is always sorted after the shorter. For example: "cab" < "cabX" where "X" can be any character(s)

Note:

Even if two different Unicode strings do not differ in level 1, 2 or 3 weights, the addition of the normalized string at the end will distinguish between them, unless they normalize to precisely the same string. This is how the normalized string defines a fourth level.

This fourth level in the base algorithm can be partially customized by specifying tailored equivalents in Step 1. These equivalents essentially modify the fourth level to eliminate differences between the equivalents. For example, if W is a tailoring equivalent to V for Swedish, then the normalized string will be the same, which means that there is no fourth-level difference.

Note:	Even if two different Unicode strings do not differ in level 1, 2 or 3 weights, the addition of the normalized string at the end will distinguish between them, unless they normalize to precisely the same string. This is how the normalized string defines a fourth level. This fourth level in the base algorithm can be partially customized by specifying tailored equivalents in Step 1. These equivalents essentially modify the fourth level to eliminate differences between the equivalents. For example, if W is a tailoring equivalent to V for Swedish, then the normalized string will be the same, which means that there is no fourth-level difference.

Example:

collation element array: [0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002]

sort key: 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062

*Example:*
collation element array:	`[0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002]`
sort key:	`0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062`

4. Compare the sort keys

Compare the sort keys for each of the imput strings, using a binary comparison. This means that:

Differences in the original strings (normalized) are ignored if there are any Level 1, 2, or 3 differences
Level 3 differences are ignored if there are any Level 1 or 2 differences
Level 2 differences are ignored if there are any Level 1 differences
Level 1 differences are never ignored.

Example:

String

Sort Key

cab 070606D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 0000 0063 0061 0062

Cab 070606D9 06EE 0000 0020 0020 0020 0000 0008 0002 0002 0000 0043 0061 0062

càb 070606D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062

dab 071206D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 0000 0064 0061 0062

*Example:*
String	Sort Key
cab	`0706``06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 0000 0063 0061 0062`
Cab	`0706``06D9 06EE 0000 0020 0020 0020 0000 0008 0002 0002 0000 0043 0061 0062`
càb	`0706``06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062`
dab	`0712``06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 0000 0064 0061 0062`

In this example, "cab" < "Cab" < "càb" < "dab". The differences that produce the ordering are shown by the boldface items:

For the first two strings, the first difference is in 0002 vs 0008 (Level 3)
For the middle two strings the first difference is in 0020 vs 0021 (Level 2)
For the last two strings, the first difference is in 0706 vs 0712 (Level 1).

Weight Derivation

Certain weights of characters are algorithmically generated from the Unicode Character Database on ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt (the format is explained on ftp://ftp.unicode.org/Public/UNIDATA/README.TXT). The following describes the process of producing these derived weights for a particular Collation Element Table. It is used for all characters except those explicitly tailored.

1. CJK Ideographs are given a primary weight equal to their Unicode character value. Their secondary and tertiary values are the respective minimal values for those levels: in the Default Unicode Collation Element Table, these are 0020 and 0002 respectively.

2. Characters without decompositions are given tertiary weights according to their General Category in the Unicode Character Database:

General Category

Weight

Lu (uppercase Letter) 0008

anything else 0002

General Category	Weight
Lu (uppercase Letter)	0008
anything else	0002

3. Characters whose decompositions are of the form base character + multiple non-spacing mark(s) have primary and tertiary weights which are the same as those of the base character. They are given a secondary weight which is derived from the secondary weights of the sequence of non-spacing marks according to a Secondary Weight Table.

This table is derived in the following way.

Find all the sequences of one or more non-spacing marks that occur in the canonical decomposition of some Unicode character.
Order these sequences of non-spacing marks by their collation elements in the Collation Element Table
Assign each sequence a derived weight based on its ordinal value in this ordering, starting at the minimum value for L2.

For example, here is the derivation for 1F83 GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI in the Default Unicode Collation Element Table. Notice that the base character's original secondary weight (0020) is completely disregarded in this process.

Character

Collation Element

Resulting Collation Element

Name

03B1 [0961.0020.0002] [0961.0020.0002] GREEK SMALL LETTER ALPHA

+ 0314 + [0000.0034.0002] COMBINING REVERSED COMMA ABOVE
= Dasia

.0020. + .0034. => .0034. [0961.0034.0002]

+ 0300 + [0000.0021.0002] COMBINING GRAVE ACCENT = Varia

.0034. + .0021. => .011B. [0961.011B.0002]

+ 0345 + [0000.0061.0002] COMBINING GREEK YPOGEGRAMMENI
= Iota subscript

.011B. + .0061. => .0131. [0961.0131.0002]

Character	Collation Element	Resulting Collation Element	Name
`03B1`	`[0961.0020.0002]`	`[0961.0020.0002]`	GREEK SMALL LETTER ALPHA
`+ 0314`	`+ [0000.0034.0002]`		COMBINING REVERSED COMMA ABOVE = Dasia
`.0020. + .0034. => .0034.`	`[0961.0034.0002]`
`+ 0300`	+ `[0000.0021.0002]`		COMBINING GRAVE ACCENT = Varia
`.0034. + .0021. => .011B.`	`[0961.011B.0002]`
`+ 0345`	`+ [0000.0061.0002]`		COMBINING GREEK YPOGEGRAMMENI = Iota subscript
`.011B. + .0061. => .0131.`	`[0961.0131.0002]`

If a Collation Element Table has been tailored so that non-spacing marks are reordered, then it will result in a different set of weights in the Secondary Weight Table.

4. Characters with other decompositions are given tertiary weights based upon their compatibility tag. The collation element is the one or more collation elements of the decomposition, with the tertiary weight reset according to the following table:

Tertiary Weight Table

Original Weight

Compatility Tag

Resulting Weight

0002 NONE 0002

0002 <wide> 0003

0002 <narrow> 0004

0002 <compat> 0005

0002 <font> 0006

0002 <circle> 0007

0008 NONE 0008

0008 <wide> 0009

0008 <compat> 000A

0008 <font> 000B

0008 <circle> 000C

0002 <super> 000D

0002 <sub> 000E

0002 <small> 000F

0002 <vertical> 0010

0002 <initial> 0011

0002 <medial> 0012

0002 <final> 0013

0002 <isolated> 0014

0002 <noBreak> 0015

0002 <square> 0016

0008 <square> 0017

0002 <fraction> 0018

Tertiary Weight Table
Original Weight	Compatility Tag	Resulting Weight
`0002`	NONE	0002
`0002`	<wide>	0003
`0002`	<narrow>	0004
`0002`	<compat>	0005
`0002`	<font>	0006
`0002`	<circle>	0007
`0008`	NONE	0008
`0008`	<wide>	0009
`0008`	<compat>	000A
`0008`	<font>	000B
`0008`	<circle>	000C
`0002`	<super>	000D
`0002`	<sub>	000E
`0002`	<small>	000F
`0002`	<vertical>	0010
`0002`	<initial>	0011
`0002`	<medial>	0012
`0002`	<final>	0013
`0002`	<isolated>	0014
`0002`	<noBreak>	0015
`0002`	<square>	0016
`0008`	<square>	0017
`0002`	<fraction>	0018

For example, from the Unicode Character Property database, SQUARE PA AMPS has a decomposition of 0070 0041, with a compatibility tag of <square>. The collation elements for this sequence is [.083C.0020.0002] [.06D9.0020.0008]. The tertiary weights are then reset according to the Tertiary Weight Table, resulting in the collation elements [.083C.0020.0016] [.06D9.0020.0017].

There are two important points to remember when doing weight derivations for decompositions.

The decompositions in the Unicode Character Database are not maximal: a compatibility decomposition can contain a character that has a canonical decomposition, and vice versa. That means that the above techniques must be applied recursively.
The older Unicode Character Database contained several canonical decompositions that were not in canonical order. If your implemenation is based on this table, then these decompositions must be put in canonical order first.

Tailoring

This section describes the mechanism for tailoring the default ordering for a particular language. A tailoring provides a specification for customizing the default ordering by:

reordering any character (or sequence of characters) with respect to others in the standard ordering. Such a reordering can represent a Level 1 difference, Level 2 difference, Level 3 difference, or identity (in levels 1 to 3). Since such reordering include sequences, arbitrary multiple mappings can be specified.
turning Variable Collation Elements on or off (see above).
specifying French Accents (see below).

Format

The following specifies the format for tailoring files. A tailoring file consists of version line followed by a list of entries, all separated by newlines. The ordering among the list of entries may be significant. Extra whitespace is not significant. A "%" character indicates a comment.

The version has the following format:

<version> := <major>.<minor>.<variant>

Each entry can have several forms, based on the following:

<entry> := <french> | <variable> | <positioning>

There are four forms of entries, described below. Each entry has a cumulative effect on the collation table. Thus where there are conflicts between any rules, the last one wins.

<french> := <frenchOn> | <frenchOff>



<frenchOn> := @FrenchSecondary <SP> <characterList>



<frenchOff> := @NotFrenchSecondary <SP> <characterList>

This simply marks a character as having a French secondary, or not. The state in the Default table is OFF.

<variable> := <variableOn> | <variableOff>



<variableOn> := @Variable <SP> <characterList>



<variableOff> := @NotVariable <SP> <characterList>

If the variable is ON, then variable elements are not ignorable. If it is OFF, then they are Level 3 Ignorablest. The state in the Default table is OFF. Setting the value to ON is the equivalent of having a large tailoring table that resets the values to be all non-ignorable.

<positioning> := <operand> <SP> <relation> <SP> <position> 



<position> := <characterList>



           := null



           := maxsi



           := maxpi



           := maxni



<operand> := <characterList>

The special values null, maxsi, maxpi, maxni are explained under Positioning Actions.

The values of <relation> are shown in the table below. The meaning of the bracketing column will be explained in the section on Bracketing.

Action

Bracketing

relation := > make primary greater than none

relation := >| position

relation := |>| operand and position

relation := >> make secondary greater than none

relation := >>| position

relation := |>>| operand and position

relation := >>> make tertiary greater than none

relation := = make identical in L1-L3 none

relation := |= operand

relation := == adds tailoring equivalent; makes identical in L1-L4 none

relation := |== operand

		Action	Bracketing
relation :=	>	make primary greater than	none
relation :=	>\|	position
relation :=	\|>\|	operand and position
relation :=	>>	make secondary greater than	none
relation :=	>>\|	position
relation :=	\|>>\|	operand and position
relation :=	>>>	make tertiary greater than	none
relation :=	=	make identical in L1-L3	none
relation :=	\|=	operand
relation :=	==	adds tailoring equivalent; makes identical in L1-L4	none
relation :=	\|==	operand

Note:
A conformant implementation of this algorithm need not supply tailoring at a fourth or higher levels (beyond what is supplied by tailoring equivalents). However, should one do so, it is recommended that the additional relations be the natural extensions of the ones in the above table.

Note:	A conformant implementation of this algorithm need not supply tailoring at a fourth or higher levels (beyond what is supplied by tailoring equivalents). However, should one do so, it is recommended that the additional relations be the natural extensions of the ones in the above table.

Positioning Entries

The mapping of characters to collation elements creates an ordering of the characters, where each one is greater than or equal to the one before it. The exact relationship will be based on the relationship between the corresponding collation elements: identical, tertiary greater than, secondary greater than, primary greater than.

For example, here is a except from the Default Unicode Collation Element Table

0030 [.06CF.0020.0002] % DIGIT ZERO



0031 [.06D0.0020.0002] % DIGIT ONE



00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE



2081 [.06D0.0020.000E] % SUBSCRIPT ONE



2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE



0032 [.06D1.0020.0002] % DIGIT TWO



...

This establishes the following ordering:

... 0 < 1 <<< ¹ <<< ₁<< @ < 2 < 3 <<< ³ <<< ₃ << $ < 4

(In these examples, "@" stands for the DINGBAT NEGATIVE CIRCLED DIGIT ONE, and "$" stands for DINGBAT NEGATIVE CIRCLED DIGIT THREE).

Basically, a positioning entry reorders the operand, removing it from where it was, and adding it at the indicated position with a specified relationship. (For now, we will explain the operation of the positioning entry simply in terms of single characters.)

Simple Operations

The simple operations (without bracketing) insert a single operand immediately after the position. For example, here are the results of various simple operations), with the change in boldface.

3 > 1

... 0 < 1 < 3 <<< ¹ <<< ₁<< @ < 2 <<< ³ <<< ₃ << $ < 4

3 >> 1

... 0 < 1 << 3 <<< ¹ <<< ₁<< @ < 2 <<< ³ <<< ₃ << $ < 4

3 >>> 1

... 0 < 1 <<< 3 <<< ¹ <<< ₁<< @ < 2 <<< ³ <<< ₃ << $ < 4

3 = 1

... 0 < 1 = 3 <<< ¹ <<< ₁<< @ < 2 <<< ³ <<< ₃ << $ < 4

Notice that in these cases, everything that was primary (secondary or tertiary) greater than the position will now be primary (respectively, secondary or tertiary) greater than the operand.

Bracketing

Bracketing the position affects the placement of the operand; in effect, inserting it at a higher position so that it does not disturb the ordering of characters that are greater than the position (but not up to the level of the specified relation). That is, a primary placement will put the operand after any Z that is only secondary greater than the position; a secondary placement will put the operand after any Z that is only tertiary greater than the position.

Here are examples of position bracketing. Compare these with the simple placement operations to see how these operations don't disturb the relationship between the operand and items just after it.

3 >| 1

... 0 < 1 <<< ¹ <<< ₁<< @ < 3 < 2 <<< ³ <<< ₃ << $ < 4

3 >>| 1

... 0 < 1 <<< ¹ <<< ₁ << 3 << @ < 2 <<< ³ <<< ₃ << $ < 4

Bracketing the operand causes a number of characters to be moved at the same time. A primary placement moves X, and every character that is only secondary or tertiary greater than X to the indicated (bracketed) position. A secondary placement moves X, and every character that is only tertiary greater than X to the indicated (bracketed) position.

Here are examples of position and operand bracketing. Compare these with the simple placement operations to see how whole groups of characters are moved.

3 |>| 1

... 0 < 1 <<< ¹ <<< ₁<< @ < 3 <<< ³ <<< ₃ << $ < 2 < 4

3 |>>| 1

... 0 < 1 <<< ¹ <<< ₁ << 3 <<< ³ <<< ₃ << @ < 2 << $ < 4

Positioning Actions

In more detail, here is the logical effect of the entry on the collation table of the simple operations. (The bracketed operations are defined as above in terms of the simple operations.) As with the main algorithm, this algorithm can be recoded for efficiency as long as the results are the same.

Let minL2 and maxL2 be the minimum and maximum values of Level 2 in the collation table; similarly, let minL3 and maxL3 be the minimum and maximum values of Level 3 in the collation table.
Assume for now that the <position> is a single character, with a single collation element. Let p1 = L1(<position>), p2 = L2(<position>, p3 = L3(<position>). If <position> is a keyword, then
- for NULL use p1 = p2 = p3 = 0;
- for MAXSI use the greatest collation element with p1 = p2 = 0;
- for MAXPI use the greatest collation element with p1 = 0
- for MAXNI use the greatest collation element.
Generate new collation elements according to the following table. In each case (except "identical"), a particular level is incremented. That level is the principal level.

Placement

Action

> CE(<operand>) = [p1+1,minL2,minL3]

>> CE(<operand>) = [p1,p2+1,minL3]

>>> CE(<operand>) = [p1,p2,p3+1]

= CE(<operand>) = [p1,p2,p3]

This has the effect of inserting CE(<operand>) before, at, or after CE(<position>) in the ordering by collation elements. A double-equal (==) has the additional effect of adding a tailorings equivalent to the normalization phase of the algorithm (step 1).

Placement	Action
>	CE(<operand>) = [p1+1,minL2,minL3]
>>	CE(<operand>) = [p1,p2+1,minL3]
>>>	CE(<operand>) = [p1,p2,p3+1]
=	CE(<operand>) = [p1,p2,p3]

Adjust the weights of any other collation elements as necessary to perform the insertion. For example, suppose CE(<operand>) is inserted after CE(<position>).
- Start with the lowest CE(X) that was greater than CE(<position>).
- If CE(X) was greater than CE(<position>), then after the insertion CE(X) will be adjusted to have that same relationship to CE(<operand>): primary, secondary or tertiary greater than.
- If by this adjustment of CE(X), the relationship to a greater CE(Y) is changed, then CE(Y) will be adjusted similarly.
- This process is repeated until no adjustments are necessary.

For example, suppose a Collation Element Table has the following weights (these are from the Default Unicode Collation Element Table):



Source  0031 [.06D0.0020.0002] % DIGIT ONE

00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE

2081 [.06D0.0020.000E] % SUBSCRIPT ONE

2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE

0032 [.06D1.0020.0002] % DIGIT TWO

Source	0031 [.06D0.0020.0002] % DIGIT ONE 00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 2081 [.06D0.0020.000E] % SUBSCRIPT ONE 2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D1.0020.0002] % DIGIT TWO

Here is the result of different entries that only differ in the strength of their relations, with the new weights shown in boldface. In each case, the changes in weights propagate until no collisions occur.

Entry

0031 >>> 00B9 % DIGIT ONE >>> SUPERSCRIPT ONE

Result

00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 0031 [.06D0.0020.000E] % DIGIT ONE 2081 [.06D0.0020.000F] % SUBSCRIPT ONE 2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D1.0020.0002] % DIGIT TWO

Entry	0031 >>> 00B9 % DIGIT ONE >>> SUPERSCRIPT ONE
Result	00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 0031 [.06D0.0020.000E] % DIGIT ONE 2081 [.06D0.0020.000F] % SUBSCRIPT ONE 2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D1.0020.0002] % DIGIT TWO

Entry

0031 >> 00B9 % DIGIT ONE >> SUPERSCRIPT ONE

Result

00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 0031 [.06D0.0021.0002] % DIGIT ONE 2081 [.06D0.0021.000E] % SUBSCRIPT ONE 2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D1.0020.0002] % DIGIT TWO

Entry	0031 >> 00B9 % DIGIT ONE >> SUPERSCRIPT ONE
Result	00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 0031 [.06D0.0021.0002] % DIGIT ONE 2081 [.06D0.0021.000E] % SUBSCRIPT ONE 2776 [.06D0.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D1.0020.0002] % DIGIT TWO

Entry

0031 > 00B9 % DIGIT ONE > SUPERSCRIPT ONE

Result

00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 0031 [.06D1.0020.0002] % DIGIT ONE 2081 [.06D1.0020.000E] % SUBSCRIPT ONE 2776 [.06D1.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D2.0020.0002] % DIGIT TWO ... 01FA [.06DA.0118.0008] % LATIN CAPITAL LETTER A WITH RING... 1E9A [.06DD.0020.0002] % LATIN SMALL LETTER A WITH RIGHT...

Entry	0031 > 00B9 % DIGIT ONE > SUPERSCRIPT ONE
Result	00B9 [.06D0.0020.000D] % SUPERSCRIPT ONE 0031 [.06D1.0020.0002] % DIGIT ONE 2081 [.06D1.0020.000E] % SUBSCRIPT ONE 2776 [.06D1.00BD.0007] % DINGBAT NEGATIVE CIRCLED DIGIT ONE 0032 [.06D2.0020.0002] % DIGIT TWO ... 01FA [.06DA.0118.0008] % LATIN CAPITAL LETTER A WITH RING... 1E9A [.06DD.0020.0002] % LATIN SMALL LETTER A WITH RIGHT...

In the last case, the primary weights have no gap up to and including 06D9, so that weight and all intervening weights are incremented.

Note:
To reduce the chance of collisions, the Default Unicode Collation Element Table has been organized so that the primary weights for letters have gaps between them.

Note:	To reduce the chance of collisions, the Default Unicode Collation Element Table has been organized so that the primary weights for letters have gaps between them.

The <operand> can be a sequence of characters. If so, then this produces a contracting mapping. For example,

0063 0068 > 0063 % ch > c
The result is that "ch" will map to a single collation element [L1(c)+1, minL2, minL3].

Inserting a character with a canonical decomposition is equivalent to inserting the decomposition itself as a sequence of characters.

If <position> has multiple collation elements, then apply the above process to the first collation element of the list, and append the remainder of the list. For example, with the relation ">>", and
CE(<position>) = [p1,p2,p3] [q1,q2,q3], the result is
CE(<operand>) = [p1,p2+1,p3] [q1,q2,q3].
If <position> has multiple characters, then it will produce an expanding mapping. Determine the resulting collation element array by concatenating the collation elements of the components. For example, take:

0078 >> 0061 0061 % x >> aa
The result is that "x" will map to two collation elements
CE(x) = [L1(a), L2(a)+1, minL3] [L1(a), L2(a), L3(a)].

"x" must map to a series of two elements so that the correct pairing of corresponding characters in two different strings is produced. For example, the above collation element mapping results in:
"baad" << "bxd" << "báad".

If <position> has multiple characters, and those characters have contractions, then multiple mappings must be taken into account. For example, suppose that "ch" is defined as a contracting mapping as above, and then a character "x" is positioned relative to "chh" in the following example:

0078 > 0063 0068 0068 % x > chh
The result is that "x" will map to two collation elements:
CE(x) = [L1(c)+2, minL2, minL3] [L1(h), L2(h), L3(h)].

Adjustments

At the end of the process of applying all of the entries to the table, the table needs to be adjusted in two ways:

if any weights had been adjusted to eliminate level separators, then they will need to be readjusted (see Reducing Sort Key Lengths).
any derived weights need to be readjusted (see Weight Derivation).

Note:

The order of the entries is very significant. For example,

0062 > 0061 % c > a 0064 > 0061 % d > a

results in the ordering a < d < c, since the d is inserted immediately after the a. This is not the same as

0064 > 0061 % d > a 0062 > 0061 % c > a

which results in the ordering a < c < d.

Note:	The order of the entries is very significant. For example, 0062 > 0061 % c > a 0064 > 0061 % d > a results in the ordering a < d < c, since the d is inserted immediately after the a. This is not the same as 0064 > 0061 % d > a 0062 > 0061 % c > a which results in the ordering a < c < d.

More detailed examples are also available in SampleTailorings.html.

Processing French Accents

French accents are represented by special Level 2 ignorables, which are in general called called backwards ignorables. Any other Level 2 ignorables are called forward ignorables.

If any backwards ignorables occur, they need to be reordered by doing the following processing after step 2 of the main algorithm:

Search through the collation element array until you find the first backwards ignorable.
Continue iterating to find the longest run of collation elements that are not forward ignorable. Backup to the last backward ignorable.
Reverse the order of the backwards ignorables within that run, maintaining the position of the collation elements that are not backward ignorables.
Continue searching.

Example:

In the following example, we show characters corresponding to the modified order of the collation elements. For clarity, characters are shown rather than the equivalent collation element array.

normalized string: c a` e" i~^ o u y' b

equivalent reordered: c a' e i o~^ u" y` b

*Example:*

In the following example, we show characters corresponding to the modified order of the collation elements. For clarity, characters are shown rather than the equivalent collation element array.
normalized string:	c a` e" i~^ o u y' b
equivalent reordered:	c a' e i o~^ u" y` b

The accents applying to the "a" are swapped with the ones applying to the "y", the ones applying to the "e" are swapped with the ones applying to the "u", and so on. Note that the order of accents applied to a particular character ("~^") are not reversed.

Preprocessing

In addition to tailoring, some implementation may choose to preprocess the text for special purposes. Once such preprocessing is done, the standard algorithm can be applied.

Examples include:

mapping "McBeth" to "MacBeth"
mapping "St." to "Street" or "Saint", depending on the context
padding digits with zeros to approximate numeric order
dropping articles, such as a or the
using extra information, such as pronunciation data for Han characters

Such preprocessing is outside of the scope of this document.

Implementation Notes

As noted above for efficiency, implementations may vary from this logical algorithm so long as they produce the same result. For example:

Reducing Sort Key Lengths

Level separators are not needed between two levels in the sort key, if the weights are properly chosen. For example, if all L3 weights are less than all L2 weights, then no level separator is needed between them. The actual numerical values chosen for weight values in the Default Data Table meet this criterion, so that by default no level separators are required between Level 1 and Level 2 or between Level 2 and Level 3 to produce proper sort keys. For example, here is a sort key with these level separators removed.

String	Sort Key
càb (0)	`0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 0000 0063 0061 0300 0062`
càb (1)	`0706 06D9 06EE 0020 0020 0021 0020 0002 0002 0002 0002 0000 0063 0061 0300 0062`

The L2 and L3 weights commonly are small values. Where that condition occurs for all possible values, they can then be represented as single 8-bit quantities. Here is the above example with both these changes (and grouping by bytes).

String	Sort Key
càb (0)	`07 06 06 D9 06 EE 00 00 00 20 00 20 00 21 00 20 00 00 00 02 00 02 00 02 00 02 00 00 00 63 00 61 03 00 00 62`
càb (1,2)	`07 06 06 D9 06 EE 20 20 21 20 02 02 02 02 00 00 00 63 00 61 03 00 00 62`

The sort key can be represented as an array of different quantities depending on the machine architecture. For example, comparisons as arrays of 32-bit quantities may be much faster on some machines. If this is done, the orginal has to be padded with zeros as appropriate.

String	Sort Key
càb (0)	`07 06 06 D9 06 EE 00 00 00 20 00 20 00 21 00 20 00 00 00 02 00 02 00 01 00 02 00 00 00 63 00 61 03 00 00 62`
càb (1,2)	`070606D9 06EE2020 21200202 01020000 00630061 03000062`

Large Weight Values

If a collation sequence requires more than 65,535 weight values (or 65,024 values where zero bytes are avoided), this can still be accomodated by using multiple collation elements for a single character. For example, suppose that 50,000 UTF-16 (surrogate) private use characters are assigned in a particular implementation, and that these are to be sorted after all other characters. Simply assign them dual collation elements of the form [FFFF.FFFF.FFFF], [yyyy.zzzz.wwww]. They will then sort properly with respect to each other and to the rest of the characters.

Reducing Table Sizes

The default collation table has secondary weights that are greater than 00FF. This is the result of the derivation described in Weight Derivation. However, these values can be compacted to a range of values that don't exceed 00FF. Whenever collation elements have different primary weights, the ordering of their secondary weights is immaterial. Thus all of the secondaries that share a single primary can be renumbered to a contiguous range without affecting the resulting order.

For example, for the primary value 0820 (for the letter O), there are 31 distinct secondary values ranging from 0020 to 012D. These can be renumbered to the contiguous range from 0020 to 003F, which is less than 00FF.
Although the secondary and tertiary weights for the default collation table can both fit within one byte, of course, any particular tailored table could conceivably end up with secondary or tertiary weights that exceed what can be contained in a single byte. However, the same technique used for large weight values can also be used for implementations that do not want to handle more than 00FF values for a particular weight.

For example, Java collation only allows 8-bit quantities in level 2 and level 3. However, characters can be given L2 or L3 weights with greater values by using a series of two collation elements. For example, with characters requiring 2000 weights at L2, then 248 characters can be given single keys, while 1792 are given 2 collation keys (presumably the 248 are chosen to be the higher frequency characters).
An implementation can thus safely pack all of the data for a collation element into a single 32-bit quantity: 16 for the primary, 8 for the secondary and 8 for the tertiary.
Since the collation element(s) of a decomposable character in the standard table can be algorithmically generated from the decomposition, no collation elements need to be stored for decomposable characters: the collation elements can be generated on the fly. The collation elements for the Han characters (unless tailored) can also be algorithmically derived, no collation elements need to be stored for them either.

This means that only a fraction of the total number of Unicode characters needs to have an explicit collation element associated with them.
Then applying techniques such as the Two-Stage table approach described in Section 5.7 of The Unicode Standard, Version 2.0, the mapping table from characters to collation elements can both fast and small. This has very important implications for implementations.

Avoiding Zero Bytes

If the resulting sort key is to be a C-string, then zero bytes must be avoided. This can be done by:
- using the value 01 01 for the level separator instead of 00 00.
- preprocessing the weight values to avoid zero bytes, such as remapping as follows:
  - x ==> 0101 + (x / 255)*256 + (x % 255)
- Where the values are limited to 8-bit quantities (as discussed above), zero bytes are even more easily avoided by just using 01 as the level separator (where one is necessary), and mapping weights by x ==> 01 + x.
This transformation can be done to the table data, so that it does not have to be done at runtime. The normalized Unicode string, however, needs to be transformed at runtime to a sequence that does not contain null bytes (but sorts the same). A convenient way to do this is to just convert to UTF-8; other transformations can also be used.

Avoiding Normalization

Implementations that do not handle separate combining marks can map precomposed characters (such as "à") to collation elements with different Level 2 weights for the different accents. This can give the same results as long as no separate combining marks are in the string.
If combining marks and precomposed characters are both present in the data, an implementation can either prenormalize the data or can do extra processing to ensure that equivalent sequences are weighted the same when constructing sort keys.
Step 3 "Form a sort key for each string" specifies appending the normalized Unicode string to the end of the sort key. Some implementations may choose to use a fourth-level weights to get the equivalent results. This also has an advantage where zero bytes are not allowed, since these fourth-level weights can be preprocessed.

Case Comparisons

In some languages, it is common to sort lowercase before uppercase; in other languages this is reversed. Often this is more dependent on the individual concerned, and is not standard across a single language. It is strongly recommended that implementations provide parameterization that allow uppercase to be sorted before lowercase, and provide information as to the standard (if any) for particular countries. This can easily be done to the Default Unicode Collation Element Table before tailoring by remapping the L3 weights. It can be done after tailoring by finding the case pairs and swapping the collation elements.

Incremental Comparison

Implementations do not actually have to produce full sort keys. Collation elements can be generated as needed from two strings, and compared with an algorithm that produces the same results as sort keys would have.
- Generally incremental comparison is more efficient than producing full sort keys if strings are only to be be compared once and if they are generally dissimilar, since differences are caught in the first few characters without having to process the entire string.
- Generally incremental comparison is less efficient than producing full sort keys if items are to be compared multiple times.

Searching

The collation elements can also be used for searching, so that a proper native-language match is produced.
Users of the algorithm may be allowed to modify the comparison levels, thus excluding differences at Level 4 (the normalized Unicode string), at Level 3, or at Level 2. This is especially useful for searching, but can also apply to comparison.
Excluding differences at Level 4 has the effect of ignoring special characters when searching. Excluding differences at Level 3 has the effect of ignoring case and compatibility format distinctions between letters when searching. Excluding differences at Level 2 has the effect of ignoring accentual distinctions when searching.

Catching Mismatches

Sort keys from two different tailored collations cannot be compared, since the weights may end up being rearranged arbitrarily. To catch this case, implementations can produce a hash value from the collation data, and prepend it to the sort key. Except in extremely rare circumstances, this will distinguish the sort keys. The implementation then has the opportunity to signal an error.

Sample

Java 1.1 implements a subset of the features described in this document. For those who are interested, the main differences are:

Java doesn't use a default table in the same format: instead it always uses a tailoring syntax. Here is a comparison for the positioning entries:

Unicode Collation Algorithm Tailoring Java

x > y & y < x

x >> y & y ; x

x >>> y & y , x

x = y & y = x

Unicode Collation Algorithm Tailoring	Java
x > y	& y < x
x >> y	& y ; x
x >>> y	& y , x
x = y	& y = x

Entries do not need to be separated by newlines. Characters can be specified directly, instead of using their hexadecimal Unicode values. In addition, wherever you have rules of the form "x < y & y < z", you can omit "& y", leaving just "x < y < z". This can be done successively, so the following are equivalent:

Unicode Collation Algorithm Tailoring	Java
0041 >>> 0061 % A >>> a 00E0 >> 0041 % à >> A 00C0 >>> 00E0 % À >>> à 0042 > 00C0 % b > À 0062 >>> 0042 % B >>> b	a, A ; à, À < b, B

Currently, variable weights must be specified explicitly, as must the ordering between upper and lowercase. French secondaries can only be turned on for all secondaries or for none. Rearrangement is not yet active. Bracketing positioning is not available: simple positioning is used instead.

Please send comments to Ken Whistler and Mark Davis

Revision	1.0
Authors	Mark Davis and Ken Whistler
Date	1997-03-30
This Version	http://www.unicode.org/unicode/reports/tr10/index.html
Previous Version	none
Latest Version	same as this version

input string:	`0E01 0E40 0E02 0E03`
normalized string:	`0E01 0E02 0E40 0E03`

Draft Unicode Technical Report #10 Unicode® Collation Algorithm