[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #5962(closed task: fixed)

Opened 2 years ago

Last modified 20 months ago

specify CLDR collation algorithm

Reported by: markus Owned by: markus
Component: xxx-spec Version: svn
Load: Data Locale:
Phase: Review: mark
Weeks: 0.1 Data Xpath:


We should have a section in the LDML spec that defines the CLDR collation algorithm.

CLDR mostly uses the Unicode Collation Algorithm (UCA), but it adds the prefix (context before) mechanism. It is mentioned in the tailoring section, but we need to specify how it fits into the collation algorithm.

Prefix matching itself should be a simple longest-match algorithm (op|c wins over p|c). We should recommend or require that both the prefix and the prefixed character-or-string have an NFC-boundary before them. (In op|ch both o and c should be starters (ccc=0) and NFC_QC=Yes.) This prevents issues with canonical reordering, and avoids the possibility of discontiguous prefix matching (unlike discontiguous contractions which are required for UCA). Prefix matching is thus always contiguous.

We need to define how prefixes interact with contractions. I propose that mappings with prefixes have precedence, and that prefixes should be matched first. This is to keep them reasonably implementable: When we have a mapping with both a prefix and a contraction suffix (like in Japanese: ぐ|ゞ), then the matching needs to go in both directions. The contraction might involve discontiguous matching which needs complex text iteration and handling of skipped combining marks (or rewriting of the text, as in the UCA). Prefix matching should be first because it is contiguous and therefore simple. Once the prefix is matched, we can return to the original text index (right after the prefix) and look at all of the contractions for the prefix.

If there is a mapping for p|c where c is a single character, and we collate text "...pc...", then the p|c mapping should win over any contractions that start with c but do not have the prefix.

Consider that we have mappings

1 p → CE(p)
2 h → CE(h)
3 c → CE(c)
4 ch → CE(d)
5 p|c → CE(u)
6 p|ci → CE(v)

This should collate text like this:

  • pc → CE(p)CE(u)
  • pci → CE(p)CE(v)
  • pch → CE(p)CE(u)CE(h)

However, if the mapping p|c → CE(u) is missing, then text pch should map to CE(p)CE(d).

We should say something about prefix matching in text that was subject to earlier discontiguous contraction matching. An implementation that rewrites the text, as in the UCA, will get different results in prefix matching on the rewritten text compared to an implementation that performs discontiguous contraction matching by other (more efficient) means. I suggest we document this, and say that this would occur only with unusual combinations of contractions, prefix rules, and input text.

Mention examples where prefixes are used: In the root collator for the middle dot preceded by "l", and in Japanese for length and iteration marks.


Change History

comment:1 Changed 2 years ago by markus

Discontiguous contraction vs. prefix match: If there is a (weird) contraction of <304f, 0308> and text <304f, 3099, 0308, 309d, 3099>, and the implementation rewrites the text to match the weird contraction, then the U+0308 is removed and the prefix of ぐ|ゞ would match. If an implementation does not rewrite the text, then that prefix would not match.

comment:2 Changed 2 years ago by emmons

  • Owner changed from anybody to markus
  • Priority changed from assess to major
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 24final

comment:3 Changed 2 years ago by markus


U+FFFE maps to lowest weight on all levels, or equivalent; requires code for some levels (case, quaternary, identical), not just data. Its primary weight is not "variable": U+FFFE must not become ignorable in alternate handling.

comment:4 Changed 2 years ago by markus

  • Status changed from assigned to accepted

In my ICU collv2 branch I implemented the algorithm as proposed, with a prefix match taking precedence over a contraction match. Note that this also means that the match that occurs earlier in the text takes precedence over a later match. (Although prefixes still do not behave quite like contractions because prefixes do not "consume" text.)

I wrote this test case:

** test: no mapping p|c, falls back to contraction ch, CLDR ticket 5962
@ rules
&d=ch &v=p|ci
* compare
<1 pc
<3 pC
<1 pcH
<1 pcI
<1 pd
=  pch
<3 pD
<1 pv
=  pci
<3 pV

comment:5 Changed 23 months ago by markus

  • Review set to mark

comment:6 Changed 20 months ago by mark

  • Status changed from accepted to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
The ticket will be disowned. The resolution will be deleted. Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.