To: UTC
From: Mark Davis, Peter Edberg, John Jenkins
[tbd]
Subject: Additions to Unihan needed for CLDR.
Date: 2011-1-28
Certain fields in Unihan data are of major
importance for internationalization library implementations: notably the
total strokes and the pinyin readings, which are needed as a basis for
collation and other services. Unfortunately the current data fields are not
well suited to use in implementations, producing many results that do not
match common user expectations. This conclusion is based on bug reports from
the field, and review by native speakers.
The following presents a proposal from the
CLDR committee for improving the Unihan data by adding new fields and
changing the contents of some fields.
- Define the
kMandarin field to contain
the most customary pinyin reading for the character. When there are two
values, then the first is preferred for zh-Hans (CN) and the second is
preferred for zh-Hant (TW). If the values would be the same, there is
only one value.
- The preferred value is the one
most commonly used in modern text, with some preference given to
readings most likely to be in sorted lists.
- This redefinition of kMandarin can be
done because the kMandarin field never had a specific definition in
terms of other standards or works.
- Define the
kTotalStrokes field to be
what is most appropriate for use with zh-Hant, and add a new field,
kTotalSimplifiedStrokes,
to be what is most appropriate for use with zh-Hans (CN). There are thus
two different fields for the two different domains.
- For each character, the
stroke count in China is fairly standardized, but there may be notable
differences in the order and number of strokes between China and the
rest of the Chinese world, for that character.
- The preferred value for each field is
the one most commonly associated with the character in modern text using
customary fonts, within that domain.
- The kTotalStrokes field
was defined to be the
value "for the character as drawn in the Unicode charts". But that is no
longer relevant or correct with multi-glyph charts, and the field can
thus be meaningfully repurposed.
- Communicate to WG2 and the IRG
the importance of this information, and the need to supply it for all
new Han character encoding.
The CLDR committee can provide initial data
for these fields based on a review and comparison against other sources such
as bihua and CNS. The data can then be improved over time.