[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #7175(accepted data)

Opened 3 years ago

Last modified 2 years ago

more compact Han collation tailorings

Reported by: markus Owned by: markus
Component: collation Data Locale: zh
Phase: rc Review:
Weeks: 0.2 Data Xpath:
Xref:

Description

We should make better use of the compact syntax in the collation tailorings, especially in the large Han tailorings. These are machine-generated. It should not be hard to fix the tool.

For example, the unihan tailorings look like this:

<'\uFDD0'⼂ # INDEX 3
<*丶 # 3.0
<*丷𪜊 # 3.1
<*丸义𠁼𠁽 # 3.2
<*丹为𠁿 # 3.3
<*主丼𠂀𠂁 # 3.4
<*𠂂 # 3.5
<*𪜋 # 3.7
<*举 # 3.8
<*𠂃 # 3.9
<*𠂄 # 3.12
<*𠂅 # 3.15

There should be one compact range per index marker (we need to break the compact syntax around their contractions). Instead, there is an extra <* for each new stroke count just so that we can add the comments. Besides, we use <* even when only a single code point follows; we should just use < where that is then still the case. The data should look like this instead:

<'\uFDD0'⼂ # INDEX 3
<*丶丷𪜊丸义𠁼𠁽丹为𠁿主丼𠂀𠂁𠂂𪜋举𠂃𠂄𠂅

This would save dozens of kB in rule strings, especially when they are stored in UTF-16 where <* is 4 bytes (like in ICU).

Similar in pinyin:

<'\uFDD0'A # INDEX A
<*阿呵𥥩锕𠼞𨉚 # ā
<*嗄 # á
<*啊 # a
<*𡉓哎哀唉𠳳埃娭挨欸㶼𡟓𢰇溾嗳𤸖銰锿噯鎄 # āi
<*𠊎𫘤啀捱皑溰䠹嘊敱敳㱯𤸳皚𦩴癌𧪚騃𩪂𩮖䶣 # ái
<*...

should be

<'\uFDD0'A # INDEX A
<*阿呵𥥩锕𠼞𨉚嗄啊𡉓哎哀唉𠳳埃娭挨欸㶼𡟓𢰇溾嗳𤸖銰锿噯鎄𠊎𫘤啀捱皑溰䠹嘊敱敳㱯𤸳皚𦩴癌𧪚騃𩪂𩮖䶣...

Similar (less bad) also in zhuyin.

Attachments

Change History

comment:1 Changed 3 years ago by emmons

  • Owner changed from anybody to markus
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 26rc

comment:2 Changed 3 years ago by markus

  • Milestone changed from 26rc to 27rc

comment:3 Changed 3 years ago by markus

The part about the unihan generation is obsolete: In ticket:7176 I moved the unihan radical-stroke order into the root collation order. There are no unihan order mappings in the CJK tailorings any more.

The other generated CJK tailorings should still be generated in a more compact form.

comment:4 Changed 3 years ago by markus

  • Phase set to rc
  • Milestone changed from 27rc to 27

comment:5 Changed 3 years ago by markus

  • Milestone changed from 27 to 28

comment:6 Changed 2 years ago by markus

  • Type changed from defect to data

comment:7 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:8 Changed 2 years ago by markus

  • Milestone changed from 28 to 29

comment:9 Changed 2 years ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.