Re: Exemplar Characters

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Nov 14 2005 - 11:37:56 CST

  • Next message: angela.schuske@jpmchase.com: "re: Some Missing Astrological Symbols Exemplar Characters (was: Re: three questions about alphabet"

    We use NFC for the exemplar character set. Any significant character
    sequence can be included as well. For example, one can have [a-h {ch}
    i-z], which indicates that {ch} is treated as a unit. So if, say, x +
    umlaut were a significant sequence, that can be included in the set as
    {ẍ} or the equivalent {x\u0308}.

    A few points.
    - the order in the set is not significant, although we (just recently)
    went to using collation order for clarity.
    - the use of character sequences is primarily pedagogical if each of the
    characters in that sequence is otherwise included. That is, if both c
    and h are in the set, then including {ch} won't make big difference in
    the usage.

    Christopher Fynn wrote:

    >
    >
    > Mark Davis wrote:
    >
    >> Logically speaking, the set of characters used by a language is a
    >> quite fuzzy, so there isn't really a black and white answer (see also
    >> http://www.unicode.org/draft/reports/tr36/tr36.html#Language_Based_Security).
    >>
    >>
    >> What we ended up doing in CLDR was having a core set of characters
    >> for a language (the 'exemplarCharacters'), plus an additional set of
    >> characters that would be seen in customary usage. For example, for
    >> English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê
    >> ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary
    >> set. (http://unicode.org/cldr/data/common/main/en.xml)
    >
    >
    > Mark
    >
    > Should the "exemplar characters" for a language include all the
    > base+combining character *combinations* frequent in that language
    > or - all the base characters and all the combining characters listed
    > separately?
    >
    > - Chris
    >
    >
    >> For the language in question, the latter is derived from dictionaries
    >> and style guidelines for major publications in the language. We don't
    >> have this in place for all languages yet, but will be expanding
    >> coverage in the CLDR 1.4 release, so feedback is welcome.
    >>
    >> Mark
    >>
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Nov 14 2005 - 11:39:01 CST