[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #8378(accepted data)

Opened 2 years ago

Last modified 16 months ago

Fix GenerateUnihanCollators (old compensating code)

Reported by: mark Owned by: mark
Component: collation Data Locale:
Phase: dsub Review:
Weeks: Data Xpath:
Xref:

Description

GenerateUnihanCollators generates the data for the several tailorings. It has a bunch of code that tries to compensate for earlier bad Unihan data. That may not be necessary anymore (and may possibly interfere with newer corrected data).

Review the code to see whether those hacks can be removed.

Attachments

Change History

comment:1 Changed 2 years ago by mark

Split from ticket:7224

comment:2 Changed 2 years ago by mark

The data files (and corresponding internal code) to to look at are the patch files:

bihua-chinese-sorting.txt
patchStrokeT.txt
patchPinyin.txt
patchStroke.txt

Part of what the code does is to for those where data is missing (and those alone), synthesize the total stroke counts, by using the radical-stroke info, and adding the strokes of the radical to the remainder. While clearly an approximation, it is better than having no information at all. That is then overridden where we have info by the stroke info in the patch files.

Also:
CJK_Radicals.csv should use the newer Unicode file

comment:3 Changed 2 years ago by emmons

  • Status changed from new to accepted
  • Component changed from unknown to collation
  • Priority changed from assess to medium
  • Milestone changed from UNSCH to 28
  • Owner changed from anybody to markus
  • Type set to data

comment:4 Changed 2 years ago by markus

  • Owner changed from markus to mark

Mark agreed to take this one, since he has been working on this code already.

comment:5 Changed 2 years ago by mark

  • Milestone changed from 28 to 29

comment:6 Changed 23 months ago by emmons

  • Milestone changed from 29 to upcoming

comment:7 Changed 16 months ago by markus

For Unicode 9, in http://www.unicode.org/utility/trac/changeset/1047 I changed GenerateUnihanCollators to get most of the radical-stroke data from org.unicode.text.UCA.RadicalStroke -- to get it working again and to reduce duplicate parsing code.

The old CJK_Radicals.csv seems to have data for all of the 2E80..2EFF CJK Radicals Supplement which seems to have been used for "closure" of the old data structure, so I am still using it for fallbacks via the radicalMap. Only some of those mappings can be gleaned from UCD CJKRadicals.txt, otherwise I could have pushed most of the fallback handling down into UCA.RadicalStroke.

comment:8 Changed 16 months ago by mark

I reviewed the code, and it looks good to me.

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.