Re: Exemplar Characters

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Nov 14 2005 - 11:27:13 CST

Next message: Mark Davis: "Re: Exemplar Characters"

Previous message: Philippe Verdy: "Re: Exemplar Characters (was: Re: three questions about alphabet files at Michael Everson site)"
In reply to: Christopher Fynn: "Exemplar Characters (was: Re: three questions about alphabet files at Michael Everson site)"
Next in thread: Philippe Verdy: "Re: Exemplar Characters"
Reply: Philippe Verdy: "Re: Exemplar Characters"
Reply: Antoine Leca: "Re: Exemplar Characters"
Maybe reply: Mark E. Shoulson: "Re: Exemplar Characters"
Maybe reply: Rick McGowan: "Re: Exemplar Characters"
Maybe reply: Kenneth Whistler: "Re: Exemplar Characters"
Maybe reply: Otto Stolz: "Re: Exemplar Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Here is basically the situation right now.

1. If a character or sequence is only ever used in a very small number
of combinations, then we tend to list those separately. For example, if
the orthography has a-z plus é (which sorts after i), but doesn't use j
and w, then the main set would be:

[a-i é k-v x-z]

1a. If the sequence can't be represented as an NFC character, then it
needs {}. So for x-umlaut, one would use

[a-i é k-v x {ẍ} y-z]

(On input, it is aways safe to use {} where there is any doubt. Thus
[abcde{e\u0308}{x\u0308}] resolves to [a-e é {ẍ}] .)

1b. Similarly, if the letter 'z' were only ever used in the combination
'tz', then we might have

[a-y {tz}]

(The language would probably have plain 'z' in the auxiliary set, for
use in foreign words.)

1c. There is some judgement involved in all this. For English one could
possibly have [a-p {qu} r-z] in the main set, with q in the auxiliary
set (for loan words like Qatar). However, it doesn't buy much, and we
just use [a-z].

2. If characters can be used productively in combination with a large
number of others (such as say Indic matras), then we don't enumerate all
the possible combinations, we just list them separately, such as:

[‌ ‍ ॐ ०-९ ऄ-ऋ ॠ ऌ ॡ ऍ-क क़ ख ख़ ग ग़ घ-ज ज़ झ-ड ड़ ढ ढ़ ण-फ फ़ ब-य य़ र-ह ़ ँ-ः ॑-॔
ऽ ् ॽ ा-ॄ ॢ ॣ ॅ-ौ]

Mark

Christopher Fynn wrote:

>
>
> Mark Davis wrote:
>
>> Logically speaking, the set of characters used by a language is a
>> quite fuzzy, so there isn't really a black and white answer (see also
>> http://www.unicode.org/draft/reports/tr36/tr36.html#Language_Based_Security).
>>
>>
>> What we ended up doing in CLDR was having a core set of characters
>> for a language (the 'exemplarCharacters'), plus an additional set of
>> characters that would be seen in customary usage. For example, for
>> English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê
>> ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary
>> set. (http://unicode.org/cldr/data/common/main/en.xml)
>
>
> Mark
>
> Should the "exemplar characters" for a language include all the
> base+combining character *combinations* frequent in that language
> or - all the base characters and all the combining characters listed
> separately?
>
> - Chris
>
>
>> For the language in question, the latter is derived from dictionaries
>> and style guidelines for major publications in the language. We don't
>> have this in place for all languages yet, but will be expanding
>> coverage in the CLDR 1.4 release, so feedback is welcome.
>>
>> Mark
>>
>
>
>
>

Next message: Mark Davis: "Re: Exemplar Characters"
Previous message: Philippe Verdy: "Re: Exemplar Characters (was: Re: three questions about alphabet files at Michael Everson site)"
In reply to: Christopher Fynn: "Exemplar Characters (was: Re: three questions about alphabet files at Michael Everson site)"
Next in thread: Philippe Verdy: "Re: Exemplar Characters"
Reply: Philippe Verdy: "Re: Exemplar Characters"
Reply: Antoine Leca: "Re: Exemplar Characters"
Maybe reply: Mark E. Shoulson: "Re: Exemplar Characters"
Maybe reply: Rick McGowan: "Re: Exemplar Characters"
Maybe reply: Kenneth Whistler: "Re: Exemplar Characters"
Maybe reply: Otto Stolz: "Re: Exemplar Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 14 2005 - 11:28:29 CST