[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #7175(accepted data)

Opened 3 years ago

Last modified 18 months ago

more compact Han collation tailorings

Reported by: markus Owned by: markus
Component: collation Data Locale: zh
Phase: rc Review:
Weeks: 0.2 Data Xpath:


We should make better use of the compact syntax in the collation tailorings, especially in the large Han tailorings. These are machine-generated. It should not be hard to fix the tool.

For example, the unihan tailorings look like this:

<'\uFDD0'⼂ # INDEX 3
<*丶 # 3.0
<*丷𪜊 # 3.1
<*丸义𠁼𠁽 # 3.2
<*丹为𠁿 # 3.3
<*主丼𠂀𠂁 # 3.4
<*𠂂 # 3.5
<*𪜋 # 3.7
<*举 # 3.8
<*𠂃 # 3.9
<*𠂄 # 3.12
<*𠂅 # 3.15

There should be one compact range per index marker (we need to break the compact syntax around their contractions). Instead, there is an extra <* for each new stroke count just so that we can add the comments. Besides, we use <* even when only a single code point follows; we should just use < where that is then still the case. The data should look like this instead:

<'\uFDD0'⼂ # INDEX 3

This would save dozens of kB in rule strings, especially when they are stored in UTF-16 where <* is 4 bytes (like in ICU).

Similar in pinyin:

<'\uFDD0'A # INDEX A
<*阿呵𥥩锕𠼞𨉚 # ā
<*嗄 # á
<*啊 # a
<*𡉓哎哀唉𠳳埃娭挨欸㶼𡟓𢰇溾嗳𤸖銰锿噯鎄 # āi
<*𠊎𫘤啀捱皑溰䠹嘊敱敳㱯𤸳皚𦩴癌𧪚騃𩪂𩮖䶣 # ái

should be

<'\uFDD0'A # INDEX A

Similar (less bad) also in zhuyin.


Change History

comment:1 Changed 3 years ago by emmons

  • Owner changed from anybody to markus
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 26rc

comment:2 Changed 3 years ago by markus

  • Milestone changed from 26rc to 27rc

comment:3 Changed 3 years ago by markus

The part about the unihan generation is obsolete: In ticket:7176 I moved the unihan radical-stroke order into the root collation order. There are no unihan order mappings in the CJK tailorings any more.

The other generated CJK tailorings should still be generated in a more compact form.

comment:4 Changed 3 years ago by markus

  • Phase set to rc
  • Milestone changed from 27rc to 27

comment:5 Changed 2 years ago by markus

  • Milestone changed from 27 to 28

comment:6 Changed 2 years ago by markus

  • Type changed from defect to data

comment:7 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:8 Changed 20 months ago by markus

  • Milestone changed from 28 to 29

comment:9 Changed 18 months ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming


Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.