From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Apr 18 2005 - 03:42:48 CST
From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
> How about the following idea of overcoming the difficulty?
> 1. Identify the characters with misleading official names.
> 2. Define better names for them in the "en" locale, and preferably
> in the "fr" locale as well.
> 3. Enhance CLDR with the feature of combining locales, in the sense
> that a user's locale choice can consist of a sequence of locales
> in order of preference. For example, a user's choice could mean
> "use the 'de' locale for anything defined there but the 'en'
> locale for things that aren't define in the 'de' locale".
This is a feature that I am wanting implemented in Java since long, instead
of the too basic "locale resolution" algorithm that just strips successively
the variant code, the region code and finally the language code (after
retrying with the system locale). Notably there's still nothing to resolve
correctly the script code (where to place it in a Java locale code? The best
place would be to put it within the language code with a separator or using
a lettercase convention or bundle resource names).
> That way, when accessing a character with a misleading official name,
> the information shown to the user would consist of its localized name
> in the "en" locale (or maybe "fr" locale), unless a name has been defined
> for it in the user's preferred locale.
This is my suggestion I repeated here several times. The Unicode-hosted CLDR
is probably the best place to archive these localized lists of character
names. Yes it is a huge task, but we don't start from nowhere (the normative
Unicode/ISO/IEC 10646 English and French names are there and we just need to
correct the few "errors" or misleading names or inaccuracies for the English
and French locales).
Also, the CLDR does not need to be part of the standard (the normative names
remain unchanged, so for example the normative English uppercase names would
be those adopted in a "C" locale, and would still be used in the Unicode
regexp "\N{name}" specifiers).
Due to that, the CLDR can be updated many times to reflect the "best
practice" for each language. The need for a consensus would be much less
critical to advance in this project.
After all, Unicode is also hosting another "huge" task with the UniHan
database (related to Han characters) that is still far from being complete
or accurate. Some parts of the Unihan database may also become part of the
new localized name lists for Chinese, Japanese and Korean locales (with
better and much more useful descriptive character names than the normative
"English" Unicode character names that just consist in the hexadecimal
codepoint); may be several defined UniHan database fields would be managed
more easily in separate locales (for example a Chinese-Pinyin locale for the
Pinyin name), and this would ease the construction of input method editors
that allow sorting and selecting Han ideographs according to user
preferences...
So Unicode, ISO and UniHan already have at least three localized working
name lists to start on, and "errors" reported in other languages could be
better reflected in the localized name lists as well, even if not all
characters are listed for all locales.
An additional source of information is the subset of "representative
characters" that form the correct alphabet of a language (already specified
in ICU):
- we could rapidly translate at least the names of these characters in these
native written languages
- and probably in IPA phonetic (enabling aural identification with speech
synthetizers, in character selectors or spelling text readers), because it
would often happen that this localized native name would often give only the
letter in the name (like the Z in LATIN SMALL LETTER Z), in a separate
locale data for oral speech.
- IPA could also help translators to provide accurate localized
orthographies of names of other characters that are foreign to the target
languages (see for example the various orthographies of English or French
names that are sometimes used for Hebrew or Arabic letters...)
This archive was generated by hypermail 2.1.5 : Mon Apr 18 2005 - 15:56:56 CST