Re: CLDR

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu May 18 2006 - 03:19:12 CDT

  • Next message: Richard Wordingham: "Re: exemplar character"

    On Tue, 16 May 2006, Richard Wordingham wrote:

    > The 'summary' at http://www.unicode.org/cldr/data_formats.html#Exemplar is
    > not completely covered by the text at
    > http://www.unicode.org/reports/tr35/#>. There's a lot of
    > highlighting, so I think the definition is experimental.

    In fact, the two descriptions (or definitions) are essentially different.
    On the other hand, they are both in a _draft_. They also present different
    _reasons_ for including these elements. _None_ of the reasons is
    compatible with the principle of restricting the sets to _letters_. In
    selecting an encoding, in "charset" (i.e., character encoding) conversion,
    and in collation all characters matter. The same applies to things that I
    would see as really _useful_ potential use of such sets, e.g. character
    recognition (in scanning), low-level input checks, choice of acceptable
    fonts, and construction of input methods that work with a small set of
    characters in environments where normal keyboard cannot be used.

    Yet people who provide actual locale data, or are supposed to do so, spend
    their time and efforts in deciding on the matter. This is rather
    frustrating.

    It was suggested in this thread that improvements to formulations be
    proposed. However, I think this is a matter of purposes and content, not
    wordings. What is the idea behind including these sets in the first
    place? To me, it seems that there's basically just an abstract idea of
    describing what letters each language uses. That isn't enough, and it
    gives far too much latitude for interpretations.

    If there is no well-defined meaning and intended uses for the elements,
    they should simply be dropped (or obsoleted or deprecated or whatever)
    from LDML - instead of trying to invent content and use for some elements
    just because they are there. Then start discussing what would really be
    needed, or useful. This might lead to a reincarnation of the two sets, but
    hopefully with clear meanings, other names, and different definitions in
    locales. Or there might be more than two sets, reflecting the need for
    different sets for essentially different types of uses.

    This might ultimately mean, for example, 1) a rock-bottom minimum set of
    characters needed for writing a language, taking into account the effect
    of historical developments, used for purposes like input design and
    typically corresponding to what people have customarily used in E-mail in
    the language; and 2) a set of normal characters used in texts in the
    language, as judged by use in non-specialized texts in printed
    matter and covering rare but not very rare characters, for use in things
    like character recognition and font choices. These would not be "exemplary
    character sets" but common character sets. It seems that the latter would
    actually be more important and more difficult to define.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Thu May 18 2006 - 03:25:22 CDT