[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #6572(closed enhancement: fixed)

Opened 5 years ago

Last modified 4 years ago

Correct pinyin collation for Chinese compounds

Reported by: pedberg Owned by: pedberg
Component: collation Data Locale:
Phase: Review: markus
Weeks: Data Xpath:




In various Chinese compounds of 2 or more characters, one of the characters has a different pinyin reading in the compound than the primary reading used for the character alone (many characters have several readings, and the reading in the compound is one of the alternates). cldrbug 6565: proposes handling this in the Han-Latin transform; this bug is about handling them in the pinyin collators. The examples from cldrbug 6565: are (

藏 } 文 →zàng; # 藏 is zàng if followed by 文 (wén) - language Tibetan
重 } 庆 →chóng; # 重 is chóng if followed by 庆 (qìng) - city Chongqing
沈 } 阳 →shěn; # 沈 is shěn if followed by 阳 (yáng) - city Shenyang

This can be handled in the collator by a combination of contraction + expansion, e.g. for the first example above:


(銺 already has reading zàng in the collator, and is the character with that reading that would immediately precede 藏)


Change History

comment:1 Changed 5 years ago by pedberg

  • Owner changed from anybody to pedberg
  • Priority changed from assess to medium
  • Status changed from new to accepted
  • Milestone changed from UNSCH to 24rc

The preceding character may be different for the short & long versions of the collations.

File ticket to add tooling to generate these automatically.

comment:2 Changed 5 years ago by pedberg

  • Xref changed from 6565 to 6565, 6589
  • Review set to markus

Given the new readings of each of the relevant characters when followed by the second character of the compound, the trick is to figure out (based on stroke count, then radical) where the character sorts in the set of other characters with the same reading. Here are the additions for the pinyin short collator:

&虫<重庆/庆 # Here 重 collates as chóng/9stk/rad166, between 虫 6stk/rad142, 崇 11stk/rad46
&弞<沈阳/阳 # Here 沈 collates as shěn/7stk/rad85, between 弞 7/stk/rad57, 审 8stk/rad40
&銺<藏文/文 # Here 藏 collates as zàng/17stk/rad140, between 銺 15stk/rad167, 臓 18stk/rad130

For the long collator, the first entry above has a different preceding character (it sorts between two non-BMP characters)

Filed cldrbug 6589: to update the Han collator generation tools to support this in the future.

comment:3 Changed 5 years ago by markus

  • Status changed from accepted to closed
  • Resolution set to fixed

comment:4 Changed 4 years ago by emmons

  • Milestone 24rc deleted

Milestone 24rc deleted


Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.