[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #7176(closed enhancement: fixed)

Opened 3 years ago

Last modified 3 years ago

push unihan into the root collation

Reported by: markus Owned by: markus
Component: collation Data Locale: zh
Phase: rc Review: pedberg
Weeks: 0.5 Data Xpath:


The default order of Han characters (Unified_Ideograph) is by Han block, then by code point. Inside each block, Han characters are assigned in order by radical, then stroke count. In other words, this sort order was the same as "unihan" when there was only the original Unihan block. Now most of the Han characters are in newer blocks, so the root order needs no mapping data but is not very useful.

Users normally look for Han characters by pronunciation, but they fall back to looking for the radical if they do not know the pronunciation. Han character dictionaries are ordered by radicals, or at least have an index by radical.

Suggestion: Change the CLDR root collation order of Han characters to be the radical-stroke order.

CLDR has type="unihan" tailorings for each of the ja, ko, zh locales. Each of these three unihan tailorings includes a permutation of the currently 74,617 Unified_Ideograph characters (Unicode 6.3). These tailorings are large: They are not built into ICU, for example, because there they would add 1.72MB to the built collation data, an increase of 66% over the 2.62MB for everything else (after fixing IcuBug:10810). Even when storing only the ICU binary collation data (without the rule strings), the increase is still 962kB, or +43% over the 2.16MB for all other tailorings. (These numbers are from the ICU 53 release candidate.)

If we push the radical-stroke order into the root, then the root data size would increase but the unihan tailorings would be very small (just the unihan index markers, and non-Han rules). I would add the radical-stroke data into FractionalUCA.txt (with special syntax that lists the Han characters as themselves, and before other mappings) but not into UCA_Rules.txt.

(In ICU it would increase the total collation data size by 11% when unihan tailorings were excluded, and reduce it by 33% when they were included.)

It would be possible to use a smaller permutation table to reduce the root data size increase from 300kB (32 bits per character) to 225kB (24 bits) or 188kB (20 bits), at the cost of some more code and data structure complexity. (In ICU I would rather add an ICU genuca tool option to revert to the DUCET implicit-weight order, for when this root size increase is not desired.)


Change History

comment:1 Changed 3 years ago by pedberg

Ooh, I like this idea!

comment:2 Changed 3 years ago by markus

For the UCA conformance tests: We would need to

  • either implement radical-stroke order in the Unicode tools
    • that could complicate the generation of FractionalUCA.txt (which maps back from implicit primaries to Han code points) unless we run its generation with different data, with just implicit Han
  • or restrict the Han characters in the test to the original Unihan block

At a glance, it looks like only a few Han characters occur in the test cases.

comment:3 Changed 3 years ago by emmons

  • Owner changed from anybody to markus
  • Priority changed from assess to medium
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 26rc

comment:4 Changed 3 years ago by markus

IcuBug:11042 "root collation with unihan (radical-stroke) order"

comment:5 Changed 3 years ago by markus

  • Status changed from assigned to reviewing
  • Review set to pedberg

comment:6 Changed 3 years ago by markus

  • Phase set to rc
  • Milestone changed from 26rc to 26

comment:7 Changed 3 years ago by pedberg

  • Status changed from reviewing to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.