Re: Exemplar Characters

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Nov 14 2005 - 11:27:13 CST

  • Next message: Mark Davis: "Re: Exemplar Characters"

    Here is basically the situation right now.

    1. If a character or sequence is only ever used in a very small number
    of combinations, then we tend to list those separately. For example, if
    the orthography has a-z plus é (which sorts after i), but doesn't use j
    and w, then the main set would be:

    [a-i é k-v x-z]

    1a. If the sequence can't be represented as an NFC character, then it
    needs {}. So for x-umlaut, one would use

    [a-i é k-v x {ẍ} y-z]

    (On input, it is aways safe to use {} where there is any doubt. Thus
    [abcde{e\u0308}{x\u0308}] resolves to [a-e é {ẍ}] .)

    1b. Similarly, if the letter 'z' were only ever used in the combination
    'tz', then we might have

    [a-y {tz}]

    (The language would probably have plain 'z' in the auxiliary set, for
    use in foreign words.)

    1c. There is some judgement involved in all this. For English one could
    possibly have [a-p {qu} r-z] in the main set, with q in the auxiliary
    set (for loan words like Qatar). However, it doesn't buy much, and we
    just use [a-z].

    2. If characters can be used productively in combination with a large
    number of others (such as say Indic matras), then we don't enumerate all
    the possible combinations, we just list them separately, such as:

    [‌ ‍ ॐ ०-९ ऄ-ऋ ॠ ऌ ॡ ऍ-क क़ ख ख़ ग ग़ घ-ज ज़ झ-ड ड़ ढ ढ़ ण-फ फ़ ब-य य़ र-ह ़ ँ-ः ॑-॔
    ऽ ् ॽ ा-ॄ ॢ ॣ ॅ-ौ]

    Mark

    Christopher Fynn wrote:

    >
    >
    > Mark Davis wrote:
    >
    >> Logically speaking, the set of characters used by a language is a
    >> quite fuzzy, so there isn't really a black and white answer (see also
    >> http://www.unicode.org/draft/reports/tr36/tr36.html#Language_Based_Security).
    >>
    >>
    >> What we ended up doing in CLDR was having a core set of characters
    >> for a language (the 'exemplarCharacters'), plus an additional set of
    >> characters that would be seen in customary usage. For example, for
    >> English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê
    >> ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary
    >> set. (http://unicode.org/cldr/data/common/main/en.xml)
    >
    >
    > Mark
    >
    > Should the "exemplar characters" for a language include all the
    > base+combining character *combinations* frequent in that language
    > or - all the base characters and all the combining characters listed
    > separately?
    >
    > - Chris
    >
    >
    >> For the language in question, the latter is derived from dictionaries
    >> and style guidelines for major publications in the language. We don't
    >> have this in place for all languages yet, but will be expanding
    >> coverage in the CLDR 1.4 release, so feedback is welcome.
    >>
    >> Mark
    >>
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Nov 14 2005 - 11:28:29 CST