[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #6850(closed enhancement: fixed)

Opened 2 years ago

Last modified 2 years ago

collation algorithm: fall back to shorter prefixes

Reported by: markus Owned by: markus
Component: xxx-spec Data Locale:
Phase: Review: pedberg
Weeks: 0.1 Data Xpath:

Description (last modified by pedberg) (diff)

LDML 24 says to fall back from mappings with the longest matching prefix directly to mappings with no prefix.

Richard Wordingham pointed out that that would yield different results for NFD input vs. composite characters:

Consider just having two extra mappings, for op|č and p|ç. Then we have CE(opç) = CE(o)CE(p)CE(c)CE(\0327), as 'c' has prefixes 'op' and 'p', and 'op' is a matching prefix. However, if one looks for mappings starting with the character 'ç', the only prefix one sees is 'p', and one would incorrectly derive CE(opç) = CE(o)CE(p)CE(p|ç).

Mark also pointed out that adding a mapping with a longer prefix could hide mappings with shorter prefixes, which would be counter-intuitive.

We should modify the fallback to go from mappings with the longest matching prefix to mappings with the next-longest prefix, and so on, ultimately to mappings with no prefix.

I have implemented this in my ICU "collv2" branch (ICU r34698).


Change History

comment:1 Changed 2 years ago by mark

We need more clarity as to the exact ordering, with examples.

It sounds like the order of preference is the following (among any that are included in the rules):


comment:2 Changed 2 years ago by emmons

  • Owner changed from anybody to markus
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 25final

comment:3 Changed 2 years ago by markus

  • Cc mark added

Richard pointed out another issue:

One should also check that with mappings for p|e, p|ê and op|ê, but not for op|e, the collation elements for opệ come out as CE(o)CE(p)CE(p|ê)CE(\u0323), and not as CE(o)CE(p)CE(op|ê)CE(\u0323).

I replied:

Yes, I think you are right, ... The trick is that for "no prefix" there is always a mapping for every code point. Discontiguous contractions can always start after a single initial code point.

When there are mappings with prefixes, there is not always a mapping for the originating code point, and when there is not then discontiguous contractions cannot continue after it.

This was a bug in my "collv2" code, and I fixed it in ICU r34719 (which also includes a couple of other changes). See the test cases there in collationtest.txt.

comment:4 Changed 2 years ago by markus

  • Status changed from assigned to reviewing
  • Review set to pedberg

comment:5 Changed 2 years ago by pedberg

  • Status changed from reviewing to closed
  • Resolution set to fixed
  • Description modified (diff)

Add a comment

Modify Ticket

as closed
The ticket will be disowned. The resolution will be deleted. Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.