From: Richard T. Gillam (rgillam@las-inc.com)
Date: Thu Apr 21 2005 - 08:22:30 CST
[Meant this to go to the list. Sorry, Peter...]
-----Original Message-----
From: Richard T. Gillam
Sent: Wednesday, April 20, 2005 6:29 PM
To: 'Peter Kirk'
Subject: RE: String name and Character Name
>A list of character names is useful only if it is entirely reliable, or
>at least is moving towards being so. If this list contains only one
>error (and there are a lot more) which is not going to be corrected,
>then the list is worthy of nothing but to be thrown out and replaced -
>if only by another almost identical list, which can be corrected.
Is it just me, or is this topic getting kind of out of hand, and maybe a
bit unnecessarily heated?
If the character names are simply intended to be alternate internal
identifiers for the characters-- alternatives that are a little more
mnemonic than the hex code point values-- they seem to be serving their
purpose perfectly well. In fact, almost anything would work. You could
say the name for U+0041 is "SDFLKJSDLFJSLK" and it'd work fine. (Okay,
that's not too mnemonic. Maybe "POINTY THING WITH CROSSBAR".) In fact,
if they're official internal identifiers, having them be consistent is
way more important than having them be mnemonic.
But because they were originally intended to be mnemonic, they wind up
taking on a resonance beyond just being programmatic identifiers. They
appear to describe the thing they identify. In most cases, they do. In
some cases ("<control>", "CJK UNIFIED IDEOGRAPH-XXXX"), they really
don't. And in a few cases, they mislead.
The problem seems to be in expecting these identifiers to do more than
they were intended to do. I would argue that software that exposes them
in a user interface is pushing them beyond their boundaries (at least if
anyone other than the Unicoderati is to use them). Even in cases where
the names _are_ descriptive, I'm not sure they should be used (at least
exclusively) in user interfaces-- if I can find U+002E only by searching
for "FULL STOP" and not be searching for "PERIOD", I lose, and if my
native language isn't English, I lose no matter what they say. But is
this the fault of the Unicode standard or the fault of the application?
Maybe what the character names do and do not represent could be better
documented (I didn't look exhaustively, but a quick check in a few of
the obvious places didn't turn up an explanation of the "name"
property). And maybe it'd be worth it to add another character
property, "Alternate Names", where corrections and alternatives to the
formal name could be placed (maybe with some indication of when the
formal name is misleading). Applications that operate on character
names would then have a machine-readable list of alternatives they
should recognize in addition to the formal names.
Failing this, it seems to me that things like Andrew West's "Unicode
Bloopers" list or the "decode Unicode" project can help a lot here,
although part of me feels like the names that are just flat wrong really
ought to be called out in the standard somewhere, or at least on the
Unicode Web site somewhere.
By the way, one famous "blooper" missing from the "Unicode Bloopers"
page is U+2118 SCRIPT CAPITAL P, which is neither script nor capital
(although it is, at least, a P).
--Rich Gillam
Language Analysis Systems, Inc.
This archive was generated by hypermail 2.1.5 : Thu Apr 21 2005 - 08:23:24 CST