Public Review Issue #131 - Han Exemplar characters

The Unicode Locales (CLDR) contains exemplar characters for each locale/language. These are the characters customarily needed for the language in question. They are actually divided into two sets: main, and auxiliary. For more information, see http://unicode.org/reports/tr35/#Character_Elements.

For Han characters, the selection of characters is not as clear-cut. CLDR has been using a fairly small set of characters, but there is a request to include more of the commonly used characters. There are a number of possible sources that we could use to derive this set, and the CLDR Technical Committee would like feedback on this. Such feedback can be filed by referring to  http://www.unicode.org/cldr/filing_bug_reports.html and following the link to "Locale Bugs".

The following options have been considered.

  1. Charsets (in the case of Japanese, this would be probably: JIS 208 + 212 +213). This would be a large set, and contain many rarely-used characters.
    • Alternate option: Only use JIS 208. (The current approach appears to be JIS 208, but only level 1.)
  2. Use the educational standards in each country/territory for primary+secondary requirements.
  3. Use the NIC sets (used for international domain names).
  4. Use the characters that are supported by the commonly-used fonts on various platforms for these languages (e.g. the characters that are in the cmaps for TrueType fonts). This option would require some analysis.
  5. The IICore* intersected with the source (using 10646 data)
These options would all produce Han character sets that would overlap to a considerable degree, particularly for the most common characters, but they would differ in details. The CLDR Technical Committee would like feedback on the best choice of these options and/or suggestions for other alternatives.

* Access to and Interpretation of IICORE.txt The machine-readable form of the IICORE subset from 10646 is available from: http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html. Download the zip file for ISO/IEC 10646:2003/Amd 1:2005(E) [c040755_ISO_IEC_10646_2003_Amd_1_2005(E)], and then extract IICORE.txt from the zip file. The format of the IICORE.txt file is described at the top of the text file. It can be parsed to determine the code points and sources for every character in the IICORE subset. The  information about "G", "T", "J", "K", sources would be used to determine the locale-specific sub-repertoires of the IICORE subset.