From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Nov 14 2005 - 11:27:13 CST
Here is basically the situation right now.
1. If a character or sequence is only ever used in a very small number
of combinations, then we tend to list those separately. For example, if
the orthography has a-z plus é (which sorts after i), but doesn't use j
and w, then the main set would be:
[a-i é k-v x-z]
1a. If the sequence can't be represented as an NFC character, then it
needs {}. So for x-umlaut, one would use
[a-i é k-v x {ẍ} y-z]
(On input, it is aways safe to use {} where there is any doubt. Thus
[abcde{e\u0308}{x\u0308}] resolves to [a-e é {ẍ}] .)
1b. Similarly, if the letter 'z' were only ever used in the combination
'tz', then we might have
[a-y {tz}]
(The language would probably have plain 'z' in the auxiliary set, for
use in foreign words.)
1c. There is some judgement involved in all this. For English one could
possibly have [a-p {qu} r-z] in the main set, with q in the auxiliary
set (for loan words like Qatar). However, it doesn't buy much, and we
just use [a-z].
2. If characters can be used productively in combination with a large
number of others (such as say Indic matras), then we don't enumerate all
the possible combinations, we just list them separately, such as:
[ ॐ ०-९ ऄ-ऋ ॠ ऌ ॡ ऍ-क क़ ख ख़ ग ग़ घ-ज ज़ झ-ड ड़ ढ ढ़ ण-फ फ़ ब-य य़ र-ह ़ ँ-ः ॑-॔
ऽ ् ॽ ा-ॄ ॢ ॣ ॅ-ौ]
Mark
Christopher Fynn wrote:
>
>
> Mark Davis wrote:
>
>> Logically speaking, the set of characters used by a language is a
>> quite fuzzy, so there isn't really a black and white answer (see also
>> http://www.unicode.org/draft/reports/tr36/tr36.html#Language_Based_Security).
>>
>>
>> What we ended up doing in CLDR was having a core set of characters
>> for a language (the 'exemplarCharacters'), plus an additional set of
>> characters that would be seen in customary usage. For example, for
>> English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê
>> ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary
>> set. (http://unicode.org/cldr/data/common/main/en.xml)
>
>
> Mark
>
> Should the "exemplar characters" for a language include all the
> base+combining character *combinations* frequent in that language
> or - all the base characters and all the combining characters listed
> separately?
>
> - Chris
>
>
>> For the language in question, the latter is derived from dictionaries
>> and style guidelines for major publications in the language. We don't
>> have this in place for all languages yet, but will be expanding
>> coverage in the CLDR 1.4 release, so feedback is welcome.
>>
>> Mark
>>
>
>
>
>
This archive was generated by hypermail 2.1.5 : Mon Nov 14 2005 - 11:28:29 CST