CLDR Ticket #7176(closed enhancement: fixed)
push unihan into the root collation
|Reported by:||markus||Owned by:||markus|
The default order of Han characters (Unified_Ideograph) is by Han block, then by code point. Inside each block, Han characters are assigned in order by radical, then stroke count. In other words, this sort order was the same as "unihan" when there was only the original Unihan block. Now most of the Han characters are in newer blocks, so the root order needs no mapping data but is not very useful.
Users normally look for Han characters by pronunciation, but they fall back to looking for the radical if they do not know the pronunciation. Han character dictionaries are ordered by radicals, or at least have an index by radical.
Suggestion: Change the CLDR root collation order of Han characters to be the radical-stroke order.
CLDR has type="unihan" tailorings for each of the ja, ko, zh locales. Each of these three unihan tailorings includes a permutation of the currently 74,617 Unified_Ideograph characters (Unicode 6.3). These tailorings are large: They are not built into ICU, for example, because there they would add 1.72MB to the built collation data, an increase of 66% over the 2.62MB for everything else (after fixing IcuBug:10810). Even when storing only the ICU binary collation data (without the rule strings), the increase is still 962kB, or +43% over the 2.16MB for all other tailorings. (These numbers are from the ICU 53 release candidate.)
If we push the radical-stroke order into the root, then the root data size would increase but the unihan tailorings would be very small (just the unihan index markers, and non-Han rules). I would add the radical-stroke data into FractionalUCA.txt (with special syntax that lists the Han characters as themselves, and before other mappings) but not into UCA_Rules.txt.
(In ICU it would increase the total collation data size by 11% when unihan tailorings were excluded, and reduce it by 33% when they were included.)
It would be possible to use a smaller permutation table to reduce the root data size increase from 300kB (32 bits per character) to 225kB (24 bits) or 188kB (20 bits), at the cost of some more code and data structure complexity. (In ICU I would rather add an ICU genuca tool option to revert to the DUCET implicit-weight order, for when this root size increase is not desired.)
- Owner changed from anybody to markus
- Priority changed from assess to medium
- Status changed from new to assigned
- Milestone changed from UNSCH to 26rc
- Status changed from assigned to reviewing
- Review set to pedberg