CLDR Ticket #5909(closed defect: fixed)
bad semantics of reset on expansion
Reported by: | markus | Owned by: | markus |
---|---|---|---|
Component: | xxx-spec | Data Locale: | |
Phase: | Review: | emmons | |
Weeks: | 0.2 | Data Xpath: | |
Xref: |
Description (last modified by markus) (diff)
Lots of details in IcuBug:9593, some more in IcuBug:9415.
Problem summary
- Propagation of only the first CE of the first reset character/contraction.
- Propagation of the remainder text, not CEs, after that first unit.
- The remainder text only propagates up to the first primary difference.
Proposal
- Propagate all of the CEs of the reset to the tailored items.
- Propagate them as CEs, not as text. This will fix &l· for example. It is also simpler.
- Propagate across primary differences too. If the first CE is variable as in &⒇=u, then the tailored item still sorts like the reset position with "ignore punctuation", rather than making it primary different.
I was unsure about what to do with primary differences, and proposed to document different approaches and their effects.
The CLDR-TC agreed to this in today's meeting.
Further proposal
With the current method of modifying the first CE, both &ae<x and &æ<x make x sort primary-after af, but intuitively one would expect the order ae, æ, x, af.
This seems even worse with Hangul syllables. Tailoring x primary-after an LVT syllable makes x sort primary-after any string that starts with that syllable's Leading consonant rather than between that syllable and the next one:
&각<x 02: 각 78 0a 34 61 01 07 01 07 00 03: 갂 78 0a 34 63 01 07 01 07 00 04: 갃 78 0a 34 65 01 07 01 07 00 05: 갛 78 0a 34 95 01 07 01 07 00 06: 개 78 0a 36 01 06 01 06 00 07: 기 78 0a 5c 01 06 01 06 00 01: x 78 0b 01 05 01 05 00 08: 까 78 0c 34 01 06 01 06 00
It looks like we should modify the last CE of at least matching strength.
- For a primary difference, modify the last primary CE. (Not the secondary CE in ä.)
- For a secondary difference, modify the last secondary or primary CE. (This would tailor the trailing secondary CE in ä. Otherwise &ä<<x would make x secondary-greater than any a-with-diacritic. The order should be ä, x, ã.)
- For a tertiary difference, modify the last tertiary, secondary or primary CE.
- For a quaternary difference (future syntax), modify the last quaternary, tertiary, secondary or primary CE. (Note: ICU will not support quaternary CEs.)
If there is no such CE, then modify what there is, maybe the first CE in this case. There will be limitations:
- Once we add a [first space] boundary CE, it will not be possible to tailor primary-after an ignorable.
- With an implementation that does not support quaternary CEs (like ICU), it will not be possible to tailor quaternary-after a completely ignorable.
CEs after the one being modified can be removed: They are swamped by the newly tailored difference. (&ä<x needs only one CE for x which is primary-greater than CE(a).)
With the above, &⑽<x puts x primary-between ⑽ and "(10[" and ⑾, but with "ignore punctuation" it becomes equal to ⑽. If we modify the last non-variable (at-least-digit-group) CE, then x becomes primary-greater than "(10[" but preserves its primary difference from ⑽ even under "ignore punctuation".
I think we should optimize for the alternate=non-ignorable case, and document that "ignore punctuation" wipes out tailoring differences like &⑽<x just like it removes differences from occurrences of normal space and punctuation characters.