From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu May 18 2006 - 03:19:12 CDT
On Tue, 16 May 2006, Richard Wordingham wrote:
> The 'summary' at http://www.unicode.org/cldr/data_formats.html#Exemplar is
> not completely covered by the text at
> http://www.unicode.org/reports/tr35/#
> highlighting, so I think the definition is experimental.
In fact, the two descriptions (or definitions) are essentially different.
On the other hand, they are both in a _draft_. They also present different
_reasons_ for including these elements. _None_ of the reasons is
compatible with the principle of restricting the sets to _letters_. In
selecting an encoding, in "charset" (i.e., character encoding) conversion,
and in collation all characters matter. The same applies to things that I
would see as really _useful_ potential use of such sets, e.g. character
recognition (in scanning), low-level input checks, choice of acceptable
fonts, and construction of input methods that work with a small set of
characters in environments where normal keyboard cannot be used.
Yet people who provide actual locale data, or are supposed to do so, spend
their time and efforts in deciding on the matter. This is rather
frustrating.
It was suggested in this thread that improvements to formulations be
proposed. However, I think this is a matter of purposes and content, not
wordings. What is the idea behind including these sets in the first
place? To me, it seems that there's basically just an abstract idea of
describing what letters each language uses. That isn't enough, and it
gives far too much latitude for interpretations.
If there is no well-defined meaning and intended uses for the elements,
they should simply be dropped (or obsoleted or deprecated or whatever)
from LDML - instead of trying to invent content and use for some elements
just because they are there. Then start discussing what would really be
needed, or useful. This might lead to a reincarnation of the two sets, but
hopefully with clear meanings, other names, and different definitions in
locales. Or there might be more than two sets, reflecting the need for
different sets for essentially different types of uses.
This might ultimately mean, for example, 1) a rock-bottom minimum set of
characters needed for writing a language, taking into account the effect
of historical developments, used for purposes like input design and
typically corresponding to what people have customarily used in E-mail in
the language; and 2) a set of normal characters used in texts in the
language, as judged by use in non-specialized texts in printed
matter and covering rare but not very rare characters, for use in things
like character recognition and font choices. These would not be "exemplary
character sets" but common character sets. It seems that the latter would
actually be more important and more difficult to define.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Thu May 18 2006 - 03:25:22 CDT