[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #5909(closed defect: fixed)

Opened 4 years ago

Last modified 3 years ago

bad semantics of reset on expansion

Reported by: markus Owned by: markus
Component: xxx-spec Data Locale:
Phase: Review: emmons
Weeks: 0.2 Data Xpath:

Description (last modified by markus) (diff)

Lots of details in IcuBug:9593, some more in IcuBug:9415.

Problem summary

  • Propagation of only the first CE of the first reset character/contraction.
  • Propagation of the remainder text, not CEs, after that first unit.
  • The remainder text only propagates up to the first primary difference.


  • Propagate all of the CEs of the reset to the tailored items.
  • Propagate them as CEs, not as text. This will fix &l· for example. It is also simpler.
  • Propagate across primary differences too. If the first CE is variable as in &⒇=u, then the tailored item still sorts like the reset position with "ignore punctuation", rather than making it primary different.

I was unsure about what to do with primary differences, and proposed to document different approaches and their effects.

The CLDR-TC agreed to this in today's meeting.

Further proposal

With the current method of modifying the first CE, both &ae<x and &æ<x make x sort primary-after af, but intuitively one would expect the order ae, æ, x, af.

This seems even worse with Hangul syllables. Tailoring x primary-after an LVT syllable makes x sort primary-after any string that starts with that syllable's Leading consonant rather than between that syllable and the next one:


02: 각
78 0a 34 61 01 07 01 07 00
03: 갂
78 0a 34 63 01 07 01 07 00
04: 갃
78 0a 34 65 01 07 01 07 00
05: 갛
78 0a 34 95 01 07 01 07 00
06: 개
78 0a 36 01 06 01 06 00
07: 기
78 0a 5c 01 06 01 06 00
01: x
78 0b 01 05 01 05 00
08: 까
78 0c 34 01 06 01 06 00

It looks like we should modify the last CE of at least matching strength.

  • For a primary difference, modify the last primary CE. (Not the secondary CE in ä.)
  • For a secondary difference, modify the last secondary or primary CE. (This would tailor the trailing secondary CE in ä. Otherwise &ä<<x would make x secondary-greater than any a-with-diacritic. The order should be ä, x, ã.)
  • For a tertiary difference, modify the last tertiary, secondary or primary CE.
  • For a quaternary difference (future syntax), modify the last quaternary, tertiary, secondary or primary CE. (Note: ICU will not support quaternary CEs.)

If there is no such CE, then modify what there is, maybe the first CE in this case. There will be limitations:

  • Once we add a [first space] boundary CE, it will not be possible to tailor primary-after an ignorable.
  • With an implementation that does not support quaternary CEs (like ICU), it will not be possible to tailor quaternary-after a completely ignorable.

CEs after the one being modified can be removed: They are swamped by the newly tailored difference. (&ä<x needs only one CE for x which is primary-greater than CE(a).)

With the above, &⑽<x puts x primary-between ⑽ and "(10[" and ⑾, but with "ignore punctuation" it becomes equal to ⑽. If we modify the last non-variable (at-least-digit-group) CE, then x becomes primary-greater than "(10[" but preserves its primary difference from ⑽ even under "ignore punctuation".

I think we should optimize for the alternate=non-ignorable case, and document that "ignore punctuation" wipes out tailoring differences like &⑽<x just like it removes differences from occurrences of normal space and punctuation characters.


Change History

comment:1 Changed 4 years ago by markus

  • Description modified (diff)

comment:2 Changed 3 years ago by markus

  • Status changed from new to accepted
  • Review set to emmons

Also used this to roll in other previously-agreed tailoring semantics fixes. See the commit comments for details.

comment:3 Changed 3 years ago by emmons

  • Status changed from accepted to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.