From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Nov 14 2005 - 11:37:56 CST
We use NFC for the exemplar character set. Any significant character
sequence can be included as well. For example, one can have [a-h {ch}
i-z], which indicates that {ch} is treated as a unit. So if, say, x +
umlaut were a significant sequence, that can be included in the set as
{ẍ} or the equivalent {x\u0308}.
A few points.
- the order in the set is not significant, although we (just recently)
went to using collation order for clarity.
- the use of character sequences is primarily pedagogical if each of the
characters in that sequence is otherwise included. That is, if both c
and h are in the set, then including {ch} won't make big difference in
the usage.
Christopher Fynn wrote:
>
>
> Mark Davis wrote:
>
>> Logically speaking, the set of characters used by a language is a
>> quite fuzzy, so there isn't really a black and white answer (see also
>> http://www.unicode.org/draft/reports/tr36/tr36.html#Language_Based_Security).
>>
>>
>> What we ended up doing in CLDR was having a core set of characters
>> for a language (the 'exemplarCharacters'), plus an additional set of
>> characters that would be seen in customary usage. For example, for
>> English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê
>> ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary
>> set. (http://unicode.org/cldr/data/common/main/en.xml)
>
>
> Mark
>
> Should the "exemplar characters" for a language include all the
> base+combining character *combinations* frequent in that language
> or - all the base characters and all the combining characters listed
> separately?
>
> - Chris
>
>
>> For the language in question, the latter is derived from dictionaries
>> and style guidelines for major publications in the language. We don't
>> have this in place for all languages yet, but will be expanding
>> coverage in the CLDR 1.4 release, so feedback is welcome.
>>
>> Mark
>>
>
>
>
This archive was generated by hypermail 2.1.5 : Mon Nov 14 2005 - 11:39:01 CST