[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #9925(closed data: fixed)

Opened 4 months ago

Last modified 4 months ago

Incorrect translit/sorting for 賈

Reported by: mark Owned by: mark
Component: translit Data Locale: zh
Phase: dsub Review: pedberg
Weeks: Data Xpath:
Xref:

Description

Our tests revealed a problem for the character 賈 in translit and sorting. I'll describe the translit problem, because the pinyin collation follows that.

In v30, the romanization was changed from jiǎ to gǔ.

Han-Latin.xml
v29:
[䑝仮假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛賈贾鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;
v30:
[⻣㒴㚉㯏㾶䀇䀜䀦䀰䐨䵻䶜傦古唃啒嘏夃尳愲扢榖榾汩淈濲瀔牯皷皼盬瞽穀糓縎罟羖股脵臌蓇薣蛊蛌蠱詁诂谷賈贾轂逧鈷钴餶馉骨鹄鹘鼓鼔𠑹𠻧𡷓𡽂𢝳𣖫𣦩𣦭𣨍𣨺𣫀𣱫𤅱𤚱𥐬𥠳𥮝𥵠𦈔𦍩𦾫𧟣𧣡𧵎𨪷𨵐𩙏𩲱𪇗𪕷]→gǔ;

BTW, 賈 does not occur in Han-Latin-Names.xml, in either version.

I looked at the data, and we should be using the Unicode 9.0 kMandarin value for 8CC8, which is jiǎ.

Unicode 9.0
U+8CC8 kHanyuPinyin 63637.070:gǔ,jià,jiǎ
U+8CC8 kMandarin jiǎ

That was unchanged from

Unicode 8.0
U+8CC8 kHanyuPinyin 63637.070:gǔ,jià,jiǎ
U+8CC8 kMandarin jiǎ

So something very odd is happening.

This is generated by a tool, as described on: https://sites.google.com/site/unicodetools/unihan

Attachments

Change History

comment:1 Changed 4 months ago by mark

Probably affects 賈贾

comment:2 Changed 4 months ago by mark

  • Priority changed from assess to major
  • Type changed from unknown to data
  • Milestone changed from UNSCH to 30.0.3

comment:3 Changed 4 months ago by mark

  • Cc pedberg added

comment:4 Changed 4 months ago by mark

  • Cc nrunge@…, fabalbon@…, markus.icu@… added
  • Owner changed from anybody to mark
  • Status changed from new to accepted

comment:5 Changed 4 months ago by mark

The goal is to see why we got a change for 賈:
v29 => jiǎ
v30 => gǔ

The data for collation and translit is generated by a tool: http://cldr.unicode.org/development/development-process/design-proposals/unihan-data

When I run GenerateUnihanCollators, I get the following result:

[䑝假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛賈贾鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;

Note that that does contain 賈 (and 贾)

When I look at the multiple versions I see:

v30:  [䑝假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;
tool: [䑝假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛賈贾鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;
v29:  [䑝仮假叚婽岬徦斚斝椵榎槚檟玾甲瘕胛賈贾鉀钾𣦉𤖰𤗜𥑐𩌍𩨹𩲣𪆲]→jiǎ;

In trunk, the character 賈 is not in jiǎ, but rather in gǔ.
[⻣㒴㚉㯏㾶䀇䀜䀦䀰䐨䵻䶜傦古唃啒嘏夃尳愲扢榖榾汩淈濲瀔牯皷皼盬瞽穀糓縎罟羖股脵臌蓇薣蛊蛌蠱詁诂谷賈贾轂逧鈷钴餶馉骨鹄鹘鼓鼔𠑹𠻧𡷓𡽂𢝳𣖫𣦩𣦭𣨍𣨺𣫀𣱫𤅱𤚱𥐬𥠳𥮝𥵠𦈔𦍩𦾫𧟣𧣡𧵎𨪷𨵐𩙏𩲱𪇗𪕷]→gǔ;

Now, other characters change, but change is to be expected, since the tool is run against a new version of Unicode.

Our tests have found another problem:
ICU57 敦化 => Dunhua
ICU58 敦化 => Duihua
Here's the data; it is also fixed if we rerun the tool.

v30:  [䃦䔻䪃吨噸墩墪惇撉撴橔犜獤礅蜳蹲蹾驐𡼖𤭞𥂦𦼿𧝗𩞤]→dūn;
tool: [䃦䔻䪃吨噸墩墪惇撉撴敦橔犜獤礅蜳蹲蹾驐𡼖𤭞𥂦𦼿𧝗𩞤]→dūn;
v29:  [䃦䔻䪃吨噸墩墪惇撉撴敦橔犜獤礅蜳蹲蹾驐𡼖𤭞𥂦𦼿𧝗𩞤]→dūn;

v30 has 敦 in
[㙂㟋㠚㬣㳔䇏䨴䨺䬈䯟兊兌兑对対對怼憝憞懟敦濧瀩碓祋綐薱襨譈譵鐓镦队陮隊𠏮𠜑𠫨𡁨𡷋𡼻𣝉𤄛𤮩𥹲𦡷𦶏𨹅𩄮𩅆𩅥𩅲𩈁𩊭𩐌𪒛𪒡]→duì;

What's hard to determine is why the tool — when run as a part of the v30 release process — produces such different results from what it generates now.

comment:6 Changed 4 months ago by markus

  • Data Locale set to zh
  • Component changed from unknown to translit

comment:7 Changed 4 months ago by mark

  • Status changed from accepted to reviewing
  • Review set to pedberg

comment:8 Changed 4 months ago by pedberg

  • Status changed from reviewing to closed
  • Resolution set to fixed

The pinyin collation is already using the updated Mandarin values, just the Han-Latin transform was not.

It is as if the tools were correctly run during CLDR 30, but only the updated collation data was checked in, and perhaps the updated Han-Latin data was inadvertently left uncommitted.

View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.