[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #9925(closed data: fixed)

Opened 18 months ago

Last modified 18 months ago

Incorrect translit/sorting for 賈

Reported by: mark Owned by: mark
Component: translit Data Locale: zh
Phase: dsub Review: pedberg
Weeks: Data Xpath:


Our tests revealed a problem for the character 賈 in translit and sorting. I'll describe the translit problem, because the pinyin collation follows that.

In v30, the romanization was changed from jiǎ to gǔ.


BTW, 賈 does not occur in Han-Latin-Names.xml, in either version.

I looked at the data, and we should be using the Unicode 9.0 kMandarin value for 8CC8, which is jiǎ.

Unicode 9.0
U+8CC8 kHanyuPinyin 63637.070:gǔ,jià,jiǎ
U+8CC8 kMandarin jiǎ

That was unchanged from

Unicode 8.0
U+8CC8 kHanyuPinyin 63637.070:gǔ,jià,jiǎ
U+8CC8 kMandarin jiǎ

So something very odd is happening.

This is generated by a tool, as described on: https://sites.google.com/site/unicodetools/unihan


Change History

comment:1 Changed 18 months ago by mark

Probably affects 賈贾

comment:2 Changed 18 months ago by mark

  • Priority changed from assess to major
  • Type changed from unknown to data
  • Milestone changed from UNSCH to 30.0.3

comment:3 Changed 18 months ago by mark

  • Cc pedberg added

comment:4 Changed 18 months ago by mark

  • Cc nrunge@…, fabalbon@…, markus.icu@… added
  • Owner changed from anybody to mark
  • Status changed from new to accepted

comment:5 Changed 18 months ago by mark

The goal is to see why we got a change for 賈:
v29 => jiǎ
v30 => gǔ

The data for collation and translit is generated by a tool: http://cldr.unicode.org/development/development-process/design-proposals/unihan-data

When I run GenerateUnihanCollators, I get the following result:


Note that that does contain 賈 (and 贾)

When I look at the multiple versions I see:

v30:  [䑝假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;
tool: [䑝假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛賈贾鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;
v29:  [䑝仮假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛賈贾鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;

In trunk, the character 賈 is not in jiǎ, but rather in gǔ.

Now, other characters change, but change is to be expected, since the tool is run against a new version of Unicode.

Our tests have found another problem:
ICU57 敦化 => Dunhua
ICU58 敦化 => Duihua
Here's the data; it is also fixed if we rerun the tool.

v30:  [䃦䔻䪃吨噸墩墪惇撉撴橔犜獤礅蜳蹲蹾驐𡼖𤭞𥂦𦼿𧝗𩞤]→dūn;
tool: [䃦䔻䪃吨噸墩墪惇撉撴敦橔犜獤礅蜳蹲蹾驐𡼖𤭞𥂦𦼿𧝗𩞤]→dūn;
v29:  [䃦䔻䪃吨噸墩墪惇撉撴敦橔犜獤礅蜳蹲蹾驐𡼖𤭞𥂦𦼿𧝗𩞤]→dūn;

v30 has 敦 in

What's hard to determine is why the tool — when run as a part of the v30 release process — produces such different results from what it generates now.

comment:6 Changed 18 months ago by markus

  • Data Locale set to zh
  • Component changed from unknown to translit

comment:7 Changed 18 months ago by mark

  • Status changed from accepted to reviewing
  • Review set to pedberg

comment:8 Changed 18 months ago by pedberg

  • Status changed from reviewing to closed
  • Resolution set to fixed

The pinyin collation is already using the updated Mandarin values, just the Han-Latin transform was not.

It is as if the tools were correctly run during CLDR 30, but only the updated collation data was checked in, and perhaps the updated Han-Latin data was inadvertently left uncommitted.


Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.