CLDR Ticket #5962(closed task: fixed)
specify CLDR collation algorithm
|Reported by:||markus||Owned by:||markus|
We should have a section in the LDML spec that defines the CLDR collation algorithm.
CLDR mostly uses the Unicode Collation Algorithm (UCA), but it adds the prefix (context before) mechanism. It is mentioned in the tailoring section, but we need to specify how it fits into the collation algorithm.
Prefix matching itself should be a simple longest-match algorithm (op|c wins over p|c). We should recommend or require that both the prefix and the prefixed character-or-string have an NFC-boundary before them. (In op|ch both o and c should be starters (ccc=0) and NFC_QC=Yes.) This prevents issues with canonical reordering, and avoids the possibility of discontiguous prefix matching (unlike discontiguous contractions which are required for UCA). Prefix matching is thus always contiguous.
We need to define how prefixes interact with contractions. I propose that mappings with prefixes have precedence, and that prefixes should be matched first. This is to keep them reasonably implementable: When we have a mapping with both a prefix and a contraction suffix (like in Japanese: ぐ|ゞ), then the matching needs to go in both directions. The contraction might involve discontiguous matching which needs complex text iteration and handling of skipped combining marks (or rewriting of the text, as in the UCA). Prefix matching should be first because it is contiguous and therefore simple. Once the prefix is matched, we can return to the original text index (right after the prefix) and look at all of the contractions for the prefix.
If there is a mapping for p|c where c is a single character, and we collate text "...pc...", then the p|c mapping should win over any contractions that start with c but do not have the prefix.
Consider that we have mappings
1 p → CE(p)
2 h → CE(h)
3 c → CE(c)
4 ch → CE(d)
5 p|c → CE(u)
6 p|ci → CE(v)
This should collate text like this:
- pc → CE(p)CE(u)
- pci → CE(p)CE(v)
- pch → CE(p)CE(u)CE(h)
However, if the mapping p|c → CE(u) is missing, then text pch should map to CE(p)CE(d).
We should say something about prefix matching in text that was subject to earlier discontiguous contraction matching. An implementation that rewrites the text, as in the UCA, will get different results in prefix matching on the rewritten text compared to an implementation that performs discontiguous contraction matching by other (more efficient) means. I suggest we document this, and say that this would occur only with unusual combinations of contractions, prefix rules, and input text.
Mention examples where prefixes are used: In the root collator for the middle dot preceded by "l", and in Japanese for length and iteration marks.
- Owner changed from anybody to markus
- Priority changed from assess to major
- Status changed from new to assigned
- Milestone changed from UNSCH to 24final