From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sat Apr 23 2005 - 14:42:43 CST
At 06:56 AM 4/23/2005, Peter Kirk wrote:
>On 21/04/2005 16:22, Doug Ewell wrote:
>
>>If the move is on to encourage software vendors to develop their own
>>proprietary lists of "accurate" character names for character-map UIs
>>and such, instead of using the official, non-perfect Unicode character
>>names, ...
>
>Has anyone actually suggested this? In my opinion, non-standardised
>proprietary names are even worse than the official but sometimes
>inaccurate names. What we need is a list which is both correct (or at
>least correctible when errors are found!) and standardised. And I accept
>that CLDR rather than Unicode proper may be the best place to go for this.
As has been noted here before, the use that SC2 had for unique names was to
make sure that the characters in the 8859 series (and potentially other SC2
and ISO standards) could easily and uniquely be correlated to their 10646
counterparts. (Some of the early arguments about character names were
driven by the fact that 8859 and 10646 names were not identical).
If that is your primary purpose, then only a single, standard and immutable
list of character names will do. Multiple lists are merely a useless
annoyance, but multiple non-standard list are worse than useless, as there
is the very real potential of names that cannot be correlated to their code
point unless one has access to a private list, or worse, multiple lists
using the same name for two different characters.
At the moment, most of this discussion is theoretical - there is a need for
people to surface some names for characters in user interfaces, but it is
not clear what the effective constraints on that process are; typos are
annoying to users, but not harmful; some of the character names, while
misleading, are not problematic enough to overcome a pretty clear
identification of the character via its representative glyph; users report
confusion even for some properly constructed character names.
But in the spirit of hypothesizing a solution, I would consider using an
alias mechanism in the way aliases are used for Property names the best
solution. For properties (and their values) there exist multiple aliases,
which are all considered unique.
This mechanism has been used to fix typos in the name of properties. For
example the linebreak property called "inseparable" had been called
"inseperable". Instead of changing that name, the correct name has become
the preferred alias and the incorrect name has been retained as an alias.
(A similar thing was done for an incorrect block
name: "Cyrillic_Supplement" instead of the incorrect
"Cyrillic_Supplementary"). The benefits of such a solution are:
1) users can use a 'correct' name to refer to a property and don't need to
use an 'incorrect' name
2) users are guaranteed that software will continue to understand the old
name, as all aliases are considered equivalent descriptions of the property
3) the UTC guarantees that all aliases from the same name space are unique
4) users can rely on that no alias will be retired
The current use of aliases for Unicode *character* names does not follow
any of these rules. They are merely alternate names that are known to be
used by some user community. However, if people other than Peter Kirk
consider the current situation in need of a formal solution, then this more
formal form of aliasing would be a way forward. It would have the benefit
of making all the naming, and name stability rules for entities related to
the Unicode Standard more uniform. At the same time, as long as one of the
aliases is formally identified as the alias corresponding to the 10646
character name, there is no direct synchronization issue. Unicode has
always provided additional information for characters.
How could this be done? One very limited way would be to add to the list of
Unicode1.0 character names. That would allow the use of a single alternate
formal alias for characters, which should be quite suitable for corrections
to the names with obvious errors. These would be printed with special
convention (for example all uppercase). The existing use of informal
character name aliases (in lower or mixed case) would continue as before.
A more extensive approach would be to introduce a full-fledged
CharacterNameAliases.txt file, which would not put an arbitrary constraint
on the number of aliases. Even in this case, the aliases in the file should
be restricted to formal aliases only, which would tend to keep their number
between 1 and 2 for almost all characters (the original name being
considered an alias as well, the numbers are 1 and 2, rather than 0 and 1).
This is pretty far from an actual proposal, but I wanted to point out that
we have solved a related problem in the space of property names in the
meantime, so perhaps now would be the time to consider whether our issues
with character names are severe enough to warrant working out such a solution.
A./
PS: the property value aliases are found in
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
and the property aliases are found in
http://www.unicode.org/Public/UNIDATA/PropertyVAliases.txt
Note that each property has a separate name space for its values,
so that both Script and Block can have a value of "Cyrillic".
This archive was generated by hypermail 2.1.5 : Sat Apr 23 2005 - 14:43:51 CST