CLDR 1.9 Collation Changes

Public Review Issue #175: CLDR 1.9 Collation Changes

The CLDR committee is making Unicode locale-sensitive collation a major focus for the next release, CLDR 1.9, and welcomes feedback on the planned changes. If you have any feedback on any of these actions, please file a comment in the relevant ticket, or file a new ticket at http://unicode.org/cldr/trac/newticket. The exact list of CLDR tickets is at: http://unicode.org/cldr/trac/report/30. More tickets may be added to this list over time. The planned changes include:

Modifying the tailoring for many languages. These tailoring changes include:
- Changing the default collation order for particular languages to use a different variant in the CLDR data
- Removing an unused collation variant
- Modifying the collation sequence for the language in other ways
Basing Pinyin and Radical stroke collations on Unihan data. Draft rules are in http://www.unicode.org/review/pr-175/, and may be updated during the public review period. These include collations for pinyin, stroke, radical-stroke. For comparison, pinyin transliteration is also included. Some additional data sources are used besides Unihan.
Removing “backwards secondaries” from default French collation. Users will still be able to set this option parametrically or via locale keywords (such as “fr-u-kb-true”) when using French (or other languages); the only change is that this option will no longer be the default for French.
Scripts and certain other categories of characters (whitespace, currency symbols, punctuation, most numbers, other symbols) will be parametrically reorderable. For example, the rules for Greek would be able to specify that the sorting order is:
- punctuation < Greek letters < numbers < currency symbols < Latin letters < other scripts and characters.
Collation rules will also allow an “import” statement, allowing for the default European Ordering Rules to be used as a basis for languages of the European Union.
The code point U+FFFF will be tailored to have a weight higher than all other characters, and disallowing further tailoring of U+FFFF for other collation variants. This allows reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”.
CLDR is planning to use a tailored UCA DUCET (Default Unicode Collation Element Table) in the root locale. This will be inherited by all other locales by default. However, there will be a separate collation also in root, with the keyword “ducet”. Using that keyword, the locale ID “und-u-co-ducet” will allow access to the original DUCET. The root locale ordering will be modified in the following ways:
- Punctuation will be grouped together, below symbols and above whitespace. The relative order of the punctuation matches the DUCET. This grouping only matters where a punctuation mark in one string is compared to a symbol in another, eg, “I♥NY” vs “I-NY”
- There are two options in the UCA for symbols and punctuation: non-ignorable, or shifted. With the shifted option, symbols and punctuation are ignored -- except at a fourth level. The default setting for CLDR will be modified so that symbols are not affected by the shifted option. So shifted only causes controls, spaces, and punctuation to be ignored, but not symbols (like ♥). The old behavior can be specified with a locale ID such as “fr-u-vt-1D371” to set the Variable section to include all of the symbols below it, or be set parametrically where implementations allow access. See also:
  - http://www.unicode.org/reports/tr35/tr35-16.html#Key_Type_Definitions under “Collation parameters”
  - http://www.unicode.org/charts/collation/
- In the DUCET, almost all currency symbols are grouped together before numbers. For the tailored UCA DUCET the two exceptions to this pattern are also moved into this group:
  - U+20A8 ( ₨ ) RUPEE SIGN
  - U+FDFC ( ﷼ ) RIAL SIGN