CLDR-UTC proposal on Unihan

To: UTC

From: Mark Davis, Peter Edberg, John Jenkins [tbd]

Subject: Additions to Unihan needed for CLDR.

Date: 2011-1-28

Certain fields in Unihan data are of major importance for internationalization library implementations: notably the total strokes and the pinyin readings, which are needed as a basis for collation and other services. Unfortunately the current data fields are not well suited to use in implementations, producing many results that do not match common user expectations. This conclusion is based on bug reports from the field, and review by native speakers.

The following presents a proposal from the CLDR committee for improving the Unihan data by adding new fields and changing the contents of some fields.

Define the kMandarin field to contain the most customary pinyin reading for the character. When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). If the values would be the same, there is only one value.

The preferred value is the one most commonly used in modern text, with some preference given to readings most likely to be in sorted lists.
This redefinition of kMandarin can be done because the kMandarin field never had a specific definition in terms of other standards or works.

Define the kTotalStrokes field to be what is most appropriate for use with zh-Hant, and add a new field, kTotalSimplifiedStrokes, to be what is most appropriate for use with zh-Hans (CN). There are thus two different fields for the two different domains.

For each character, the stroke count in China is fairly standardized, but there may be notable differences in the order and number of strokes between China and the rest of the Chinese world, for that character.
The preferred value for each field is the one most commonly associated with the character in modern text using customary fonts, within that domain.
The kTotalStrokes field was defined to be the value "for the character as drawn in the Unicode charts". But that is no longer relevant or correct with multi-glyph charts, and the field can thus be meaningfully repurposed.

Communicate to WG2 and the IRG the importance of this information, and the need to supply it for all new Han character encoding.

The CLDR committee can provide initial data for these fields based on a review and comparison against other sources such as bihua and CNS. The data can then be improved over time.