[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #7202(closed: fixed)

Opened 5 years ago

Last modified 4 years ago

simplify U+FFFE collation

Reported by: markus Owned by: markus
Component: xxx-spec Data Locale:
Phase: rc Review: mark
Weeks: 0.1 Data Xpath:



We have in LDML 5 Collation 1.1.1 U+FFFE

U+FFFE maps to a CE with special minimal weights on all levels, including case, quaternary and identical levels — which may require special code for those levels. Its primary weight is not "variable": U+FFFE must not become ignorable in alternate handling. This allows for Merging Sort Keys within code point space. For example, when sorting names in a database, a sortable string can be formed with last_name + '\uFFFE' + first_name. These strings would sort properly, without ever comparing the last part of a last name with the first part of another first name.

For backwards secondary level sorting, text segments separated by U+FFFE are processed in forward segment order, and within each segment the secondary weights are compared backwards. This is so that such combined strings are processed consistently with concatenating their sort keys.

We should loosen the first sentence to "U+FFFE maps to a CE with a special minimal primary weight."

Rationale: With a unique primary weight, the other weights will never be compared with weights of any primary CE. As long as all CEs, including the one for U+FFFE, are UCA-well-formed, the other weights are at most compared to those of ignorables, which works by design.

Note that we forbid tailoring to U+FFFE, so its primary weight is unique.

We could add discussion of a trade-off:

With unique, low weights on all levels it is possible to achieve sortkey(str1 + "\uFFFE" + str2) == mergeSortkeys(sortkey(str1), sortkey(str2)).

When that is not necessary, then code can be a little simpler (no special handling for U+FFFE except for backwards-secondary), sort keys can be a little shorter (when using compressible common non-primary weights for U+FFFE), and another low weight can be used in tailorings.

We may change the FractionalUCA.txt mapping from FFFE; [02, 02, 02] to FFFE; [02, 05, 05]. (05 is the "common" secondary/tertiary weight.) Implementations could change it from one to the other as needed.

Kudos again to Richard Wordingham.


Change History

comment:1 Changed 5 years ago by markus

  • Xref set to 7179

comment:2 Changed 5 years ago by markus

We probably do need a special low value in the identical level (which does not work with collation elements, so it's not a "weight").

comment:3 Changed 5 years ago by markus

Note: In FractionalUCA.txt, CE(U+FFFE) has low secondary/tertiary weights:

FFFE;	[02, 02, 02]	# Special LOWEST primary, for merge/interleaving

However, in allkeys_CLDR.txt, CE(U+FFFE) has "common" secondary/tertiary weights:

FFFE  ; [.0001.0020.0005.FFFE] # <noncharacter-FFFE>

This inconsistency means that we might be free to change it to be consistently either one or the other...

comment:4 Changed 5 years ago by emmons

  • Owner changed from anybody to markus
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 26final

comment:5 Changed 5 years ago by markus

  • Milestone changed from 26 to 27rc

comment:6 Changed 5 years ago by markus

  • Phase set to rc
  • Milestone changed from 27rc to 27

comment:7 Changed 4 years ago by markus

  • Review set to mark

Discussion was in cldr-users email “Non-primary Weights of U+FFFE” 2014mar30 with Richard Wordingham.

comment:9 Changed 4 years ago by markus

  • Status changed from assigned to reviewing

I simplified U+FFFE collation in ICU along these lines, see IcuBug:10829.

comment:10 Changed 4 years ago by mark

  • Status changed from reviewing to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.