CLDR Ticket #7175(accepted data)
more compact Han collation tailorings
|Reported by:||markus||Owned by:||markus|
We should make better use of the compact syntax in the collation tailorings, especially in the large Han tailorings. These are machine-generated. It should not be hard to fix the tool.
For example, the unihan tailorings look like this:
<'\uFDD0'⼂ # INDEX 3 <*丶 # 3.0 <*丷𪜊 # 3.1 <*丸义𠁼𠁽 # 3.2 <*丹为𠁿 # 3.3 <*主丼𠂀𠂁 # 3.4 <*𠂂 # 3.5 <*𪜋 # 3.7 <*举 # 3.8 <*𠂃 # 3.9 <*𠂄 # 3.12 <*𠂅 # 3.15
There should be one compact range per index marker (we need to break the compact syntax around their contractions). Instead, there is an extra <* for each new stroke count just so that we can add the comments. Besides, we use <* even when only a single code point follows; we should just use < where that is then still the case. The data should look like this instead:
<'\uFDD0'⼂ # INDEX 3 <*丶丷𪜊丸义𠁼𠁽丹为𠁿主丼𠂀𠂁𠂂𪜋举𠂃𠂄𠂅
This would save dozens of kB in rule strings, especially when they are stored in UTF-16 where <* is 4 bytes (like in ICU).
Similar in pinyin:
<'\uFDD0'A # INDEX A <*阿呵𥥩锕𠼞𨉚 # ā <*嗄 # á <*啊 # a <*𡉓哎哀唉𠳳埃娭挨欸㶼𡟓𢰇溾嗳𤸖銰锿噯鎄 # āi <*𠊎𫘤啀捱皑溰䠹嘊敱敳㱯𤸳皚𦩴癌𧪚騃𩪂𩮖䶣 # ái <*...
<'\uFDD0'A # INDEX A <*阿呵𥥩锕𠼞𨉚嗄啊𡉓哎哀唉𠳳埃娭挨欸㶼𡟓𢰇溾嗳𤸖銰锿噯鎄𠊎𫘤啀捱皑溰䠹嘊敱敳㱯𤸳皚𦩴癌𧪚騃𩪂𩮖䶣...
Similar (less bad) also in zhuyin.
- Owner changed from anybody to markus
- Status changed from new to assigned
- Milestone changed from UNSCH to 26rc