[Unicode]  Unicode Collation Algorithm
 

UCA Auxiliary Files

Version 6.2.0
2012-09-12

The files in this directory provide remapping and tailoring data for the UCA Version 6.2.0 DUCET weights, for use with CLDR. These files are large, and thus packaged in zip format to save download time.

Comments with DUCET-style weights in files other than allkeys_CLDR.txt use the weights defined in allkeys_CLDR.txt.


CLDR Tailoring

Starting from version 1.9, CLDR allowed the tailoring of the root locale, allowing it to be different from DUCET.

The ordering in the root locale is used by all other locales by default. However, there is separate collation tailoring also in root, with the keyword “ducet”, that tailors the modified DUCET back to the original, if required. Using that keyword, the locale ID “und-u-co-ducet” allows access to the original DUCET.

The root locale ordering is tailored in the following ways:

Grouping classes of characters

As of Version 6.1.0, the DUCET puts characters into the following ordering:

(There are a few exceptions to this general ordering.)

The CLDR root locale modifies the DUCET tailoring by ordering the common characters more strictly by category:

What the regrouping allows is for users to parametrically reorder the groups. For example, users can reorder numbers after all scripts, or reorder Greek before Latin.

The relative order within each of these groups still matches the DUCET. Symbols, punctuation, and numbers that are grouped with a particular script stay with that script. The differences between CLDR and the DUCET order are:

  1. CLDR groups the numbers together after currency symbols, instead of splitting them with some before and some after. Thus the following are put after currencies and just before all the other numbers.

    U+09F4 ( ৴ ) [No] BENGALI CURRENCY NUMERATOR ONE
    ...
    U+1D371 ( 𝍱 ) [No] COUNTING ROD TENS DIGIT NINE

  2. CLDR handles a few other characters differently
    1. U+10A7F ( 𐩿 ) [Po] OLD SOUTH ARABIAN NUMERIC INDICATOR is put with punctuation, not symbols
    2. U+20A8 ( ₨ ) [Sc] RUPEE SIGN and U+FDFC ( ﷼ ) [Sc] RIAL SIGN are put with currency signs, not with R and REH.

Non-variable symbols

There are multiple Variable-Weighting options in the UCA for symbols and punctuation, including non-ignorable and shifted. With the shifted option, almost all symbols and punctuation are ignored—except at a fourth level. The CLDR root locale ordering is modified so that symbols are not affected by the shifted option. That is, by default, symbols are not “variable” in CLDR. So shifted only causes whitespace and punctuation to be ignored, but not symbols (like ♥). The DUCET behavior can be specified with a locale ID using the "vt" keyword, to set the Variable section to include all of the symbols below it, or be set parametrically where implementations allow access.

See also:

Additional contractions for Tibetan

Ten contractions are added for Tibetan: Two to fulfill well-formedness condition 5, and eight more to preserve the default order for Tibetan. For details see UTS #10, Section 3.6.4, Well-Formedness of the DUCET.

Tailored noncharacter weights

The code point U+FFFF is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings starting with "sch" or equivalent.

The code point U+FFFE produces a CE with special minimal weights on all levels, regardless of alternate handling. This allows for Merging Sort Keys within code point space. For example, when sorting names in a database, a sortable string can be formed with last_name + '\uFFFE' + first_name. These strings would sort properly, without ever comparing the last part of a last name with the first part of another first name.

In CLDR, so as to maintain the highest and lowest status, these values are not further tailorable, and nothing can tailor to them. That is, neither can occur in a collation rule: for example, the following rules are illegal:

& \uFFFF < x

& x <\uFFFF


File Formats

The file formats may change between versions of the UCA. The formats for Version 6.0.0 and beyond are as follows. As usual, text after a # is a comment.

FractionalUCA_summary.txt

The lines are all comments, giving an overview of the weight structure. Lines are tab delimited, for use with spreadsheets. Most are in the format:

# position fractional-weight Script General-category Codepoint Name

where position is "First" or "Last"

Example

# First 03 05 Zyyy Cc U+0009 <CHARACTER TABULATION>
# Last 03 15 Zyyy Zp U+2029 PARAGRAPH SEPARATOR

The topmost variable weight is marked with the following format:

[variable top = 0C 32 04] # END OF VARIABLE SECTION!!!

The purpose of other lines is given by comments.

FractionalUCA.txt

The format is illustrated by the following sample lines, with commentary afterwards.

[UCA version = 6.0.0]

Provides the version number of the UCA table.

0000; [,,] # Zyyy Cc [0000.0000.0000] * <NULL>

Provides a weight line. The first element (before the ";") is a hex codepoint sequence. The second field is a sequence of collation elements. Each collation element has 3 parts separated by commas: the primary weight, secondary weight, and tertiary weight. The tertiary weight actually consists of two components: the top two bits (0xC0) are used for the case level, and should be masked off where a case level is not used.

A weight is either empty (meaning a zero or ignorable weight) or is a sequence of one or more bytes. The bytes are interpreted as a "fraction", meaning that the ordering is 04 < 05 05 < 06. The weights are constructed so that no weight is an initial subsequence of another: that is, having both the weights 05 and 05 05 is illegal. The above line consists of all ignorable weights.

The vertical bar (“|”) character is used to indicate context, as in:

006C | 00B7; [, DB A9, 05]
This example indicates that if U+00B7 appears immediately after U+006C, it is given the corresponding collation element instead. This syntax is equivalent to the following contraction, but is more efficient.
006C 00B7; CE(006C) [, DB A9, 05]
Single-byte primary weights are given to particularly frequent characters, such as space, digits, and a-z. Most characters are given two-byte weights, while relatively infrequent characters are given three-byte weights. For example:

...
0009; [03 05, 05, 05] # Zyyy Cc [0100.0020.0002] * <CHARACTER TABULATION>
...
1B60; [06 14 0C, 05, 05] # Bali Po [0111.0020.0002] * BALINESE PAMENENG
...
0031; [14, 05, 05] # Zyyy Nd [149B.0020.0002] * DIGIT ONE

The assignment of 2 vs 3 bytes does not reflect importance, or exact frequency.

# SPECIAL MAX/MIN COLLATION ELEMENTS

FFFE; [02, 02, 02] # Special LOWEST primary, for merge/interleaving
FFFF; [EF FE, 05, 05] # Special HIGHEST primary, for ranges

The two tailored noncharacters have their own weights.

# SPECIAL FINAL VALUES for Script Reordering

FDD0 0042; [05 FE, 05, 05] # Special final value for reordering token
FDD0 0043; [0C FE, 05, 05] # Special final value for reordering token

There are special values assigned to code point sequences FDD0+X. These sequences are simply used to communicate special values, and can be eliminated. For the reordering values, the purpose is to make sure that there is a "high" weight at the end of each reordering group.

...
# HOMELESS COLLATION ELEMENTS
FDD0 0063; [, 97, 3D] # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F] * U+01C6 LATIN SMALL LETTER DZ WITH CARON
FDD0 0064; [, A7, 09] # [15D1.0020.0004] [0000.0056.0004] * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA
FDD0 0065; [, B1, 09] # [1644.0020.0004] [0000.0061.0004] * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE

The DUCET has some weights that don't correspond directly to a character. To allow for implementations to have a character associated with each weight (necessary for certain implementations of tailoring), this requires the construction of special sequences for those weights.

Next in the first a number of tables are defined. The function of each of the tables is summarized afterwards.

# VALUES BASED ON UCA
...
[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT
[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032
[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED
[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED
[first trailing [E5, 05, 05]] # CONSTRUCTED
[last trailing [E5, 05, 05]] # CONSTRUCTED
...

This table summarizes ranges of important groups of characters for implementations.

# Top Byte => Reordering Tokens
[top_byte 00 TERMINATOR ] # [0] TERMINATOR=1
[top_byte 01 LEVEL-SEPARATOR ] # [0] LEVEL-SEPARATOR=1
[top_byte 02 FIELD-SEPARATOR ] # [0] FIELD-SEPARATOR=1
[top_byte 03 SPACE ] # [9] SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1
...

This table maps from the first bytes of the fractional weights to a reordering token. The format is "[top_byte " byte-value reordering-token "COMPRESS"? "]". The "COMPRESS" value is present when there is only one byte in the reordering token, and primary-weight compression can be applied. Most reordering tokens are script values; others are special-purpose values, such as PUNCTUATION.

# Reordering Tokens => Top Bytes
[reorderingTokens Arab 61=910 62=910 ]
[reorderingTokens Armi 7A=22 ]
[reorderingTokens Armn 5F=82 ]
[reorderingTokens Avst 7A=54 ]
...

This table is an inverse mapping from reordering token to top byte(s). In terms like "61=910", the first value is the top byte, while the second is informational, indicating the number of primaries assigned with that top byte.

# General Categories => Top Byte
[categories Cc 03{SPACE}=6 ]
[categories Cf 77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ]
[categories Lm 0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...

This table is informational, providing the top bytes, scripts, and primaries associated with each general category value.

# FIXED VALUES
[fixed first implicit byte E0]
[fixed last implicit byte E4]
[fixed first trail byte E5]
[fixed last trail byte EF]
[fixed first special byte F0]
[fixed last special byte FF]

The final table gives certain hard-coded byte values. The "trail" area is provided for implementation of the "trailing weights" as described in the UCA.

allkeys_CLDR.txt

When different from the DUCET table, represents the tailoring used in CLDR as described under CLDR Tailoring.

The format is similar to that of allkeys.txt, although there may be some differences in whitespace.

UCA_Rules.txt

The format for this file uses the CLDR "short" format for collation tailoring. Some examples of the format are provided below.

< 𝍱 # 5.0 [No] [1499.0020.0002] U+1D371 COUNTING ROD TENS DIGIT NINE
< 0 # 1.1 [Nd] [149A.0020.0002] U+0030 DIGIT ZERO

The ASCII ZERO is primary-greater than U+1D371 COUNTING ROD TENS DIGIT

<<< 0 # 1.1 [Nd] [149A.0020.0003] U+FF10 FULLWIDTH DIGIT ZERO

The fullwidth ZERO is tertiary-after the ASCII ZERO.

<<< 🄁 / ',' # 5.2 [Nd/No] [149A.0020.0004] [011F.0020.0004] U+1F101 DIGIT ZERO COMMA / 002C

The U+1F101 DIGIT ZERO COMMA is tertiary-greater than the fullwidth ZERO followed by U+002C (comma)

= 𝟘 # 3.1 [Nd] [149A.0020.0005] U+1D7D8 MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO

The 05] U+1D7D8 MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO is primary, secondary, and tertiary equal to U+1F101 DIGIT ZERO COMMA

For more information about this format for collation tailoring, see LDML.

UCA_Rules.xml

Provides the same rule set in CLDR XML format. For more information, see LDML.

UCA_Rules_NoCE.txt

Omits the (remapped) DUCET value, for more effective comparison across versions.


Implicit Fractional Weight Generation

In the UCA, implicit weights are given to characters such as CJK ideographs and unassigned code points. The implicit fractional weights in FractionalUCA.txt have a different generation algorithm. It is important to match the exact algorithm so that the weights for compatibility decomposable characters match.

The fractional weights for implicit collation elements are between 3 and 4 bytes long.

The algorithm for implicit fractional weight generation is expressed here in pseudocode:

int getImplicit(int cp) {
 cp += swapCJK(cp) + 1;
 int last0 = cp - min4Boundary;
 if (last0 < 0) {
   int last1 = cp / final3Count;
   last0 = cp % final3Count;
   int last2 = last1 / medialCount;
   last1 %= medialCount;
   last0 = minTrail + last0*final3Multiplier; // spread out, leaving gap at start
   last1 = minTrail + last1; // offset
   last2 = min3Primary + last2; // offset
   return (last2 << 24) + (last1 << 16) + (last0 << 8);
 } else {
   int last1 = last0 / final4Count;
   last0 %= final4Count;
   int last2 = last1 / medialCount;
   last1 %= medialCount;
   int last3 = last2 / medialCount;
   last2 %= medialCount;
   last0 = minTrail + last0*final4Multiplier; // spread out, leaving gap at start 
   last1 = minTrail + last1; // offset
   last2 = minTrail + last2; // offset
   last3 = min4Primary + last3; // offset
   return (last3 << 24) + (last2 << 16) + (last1 << 8) + last0;
 }
}
static public int divideAndRoundUp(int a, int b) {
   return 1 + (a-1)/b;
}
// The constants for the boundaries of CJK characters vary from release to release. // They are of the form CJK*BASE and CJK*LIMIT, and may be expanded to new blocks as well.
int 
CJK_BASE = 0x4E00,
CJK_LIMIT = 0x9FCC+1,
CJK_COMPAT_USED_BASE = 0xFA0E,
CJK_COMPAT_USED_LIMIT = 0xFA2F+1,
CJK_A_BASE = 0x3400,
CJK_A_LIMIT = 0x4DB5+1,
CJK_B_BASE = 0x20000,
CJK_B_LIMIT = 0x2A6D6+1,
CJK_C_BASE = 0x2A700,
CJK_C_LIMIT = 0x2B734+1,
CJK_D_BASE = 0x2B740,
CJK_D_LIMIT = 0x2B81D+1,
MAX_INPUT = 0x220001,
NON_CJK_OFFSET = 0x110000,
min3Primary = 0xE0,
max4Primary = 0xE4,
minTrail = 0x04,
maxTrail = 0xFE,
gap3 = 1,
primaries3count = 1,
final3Multiplier = gap3 + 1,
final3Count = (maxTrail - minTrail + 1) / final3Multiplier,
max3Trail = minTrail + (final3Count - 1) * final3Multiplier,
medialCount = (maxTrail - minTrail + 1),
threeByteCount = medialCount * final3Count,
primariesAvailable = max4Primary - min3Primary + 1,
primaries4count = primariesAvailable - primaries3count,
min3ByteCoverage = primaries3count * threeByteCount,
min4Primary = min3Primary + primaries3count,
min4Boundary = min3ByteCoverage,
 
totalNeeded = MAX_INPUT - min4Boundary,
neededPerPrimaryByte = divideAndRoundUp(totalNeeded, primaries4count),
neededPerFinalByte = divideAndRoundUp(neededPerPrimaryByte, medialCount * medialCount),
gap4 = (maxTrail - minTrail - 1) / neededPerFinalByte,
final4Multiplier = gap4 + 1,
final4Count = neededPerFinalByte;
int swapCJK(int i) {
   if (i >= CJK_BASE) {
   if (i < CJK_LIMIT)              return i - CJK_BASE;
   if (i < CJK_COMPAT_USED_BASE)   return i + NON_CJK_OFFSET;
   if (i < CJK_COMPAT_USED_LIMIT)  return i - CJK_COMPAT_USED_BASE 
                                          + (CJK_LIMIT - CJK_BASE);
   if (i < CJK_B_BASE)             return i + NON_CJK_OFFSET;
   if (i < CJK_B_LIMIT)            return i; // non-BMP-CJK
   if (i < CJK_C_BASE)             return i + NON_CJK_OFFSET;
   if (i < CJK_C_LIMIT)            return i; // non-BMP-CJK
   if (i < CJK_D_BASE)             return i + NON_CJK_OFFSET;
   if (i < CJK_D_LIMIT)            return i; // non-BMP-CJK
     return i + NON_CJK_OFFSET;  // non-CJK
   }
   if (i < CJK_A_BASE)             return i + NON_CJK_OFFSET;
   if (i < CJK_A_LIMIT)            return i - CJK_A_BASE
                                          + (CJK_LIMIT - CJK_BASE) 
                                          + (CJK_COMPAT_USED_LIMIT - CJK_COMPAT_USED_BASE);
     return i + NON_CJK_OFFSET; // non-CJK
   }