From: Hans Aberg (haberg@math.su.se)
Date: Fri Apr 22 2005 - 05:26:21 CST
At 08:38 +0100 2005/04/22, Arcane Jill wrote:
>>I don't know why there is a need for a
>>second "unique and immutable identifier" in addition to the U+xxxx code
>>point identifier. But given that there is such a list, its highly
>>restricted intended purpose should be made more clear. This must be done
>>to reduce the problem of people, even major software companies which are
>>Unicode consortium members, using the list in unintended ways as
>>meaningful text.
>Like some others here, I simply don't see the point of
>human-readable machine-readable list. One or the other, yes, but not
>both at the same time. There is absolutely no need for an immutable
>machine-readable list to be human-readable /at the same time/.
>U-[xx]xxxx works perfectly well as a unique machine-readable
>identifier, /and/ would work perfectly well as a localization key.
>(In fact, a database table which uses a numeric primary key is
>likely to be more efficient than a database table that uses a string
>primary key).
Giving each abstract character a unique, human readable name, is in
the first place useful to humans that want to use it to identify the
characters. If one wants to say define a new character set, that
eventually might get its own character numbering, then structurally,
it would be better to use those names.
Then, whether one would use those character names or the U-X..X
character numbers, is just a question of what is useful
implementationwise. If there is a list by which one can always
translate back and forth between character names and character
numbers, then, in an implementation, one can always use say the
character numbers, and translate into character names, whenever a
human would want to interpret it. But in a computer implementation,
one should not assume that an efficient logical representation leads
to an efficient computer implementation. For example, when
implementing functional language, there is an efficient de Bruijn
representation that does not need traditional lambda variable names,
with some other pleasing logical properties. However, it is rarely
used in actual functional language implementations, as debugging,
which is carried out by humans, becomes very difficult.
One can also note that the U-X..X numbers are there only because they
are thought to be efficient with our current computer technologies.
If one, in a computer, compares two strings, then that in effect
amounts to comparing multiprecision numbers, which one may want to
avoid in a time critical application. Second, one can note that if
one were to represent a text using the character names explicitly,
and the applies a common compression technique, then that
compression, if properly done will create a character table which
will be more efficient than the U-X..X representation. So with
systematic use of suitable compression techniques, one may do away
with both the character numbers U-X..X as well the various character
encodings UTF-8/16/32.
In the end, this discussion leads to one common about computer
languages, the latter which usually all are Turing equivalent. If all
these computer languages are Turing equivalent, and thus can process
exactly the same algorithm, why not simply select one computer
language, and do away with the all others? In reality, though
different computer languages differ immensively in how different
logical structures are efficiently implemented both to humans and in
the computer. So one ends up with tradeoffs which will depend on the
humans, implementations and the computer the software will run on.
The same will apply with choices such as between using the U-X..X or
the character name identification.
-- Hans Aberg
This archive was generated by hypermail 2.1.5 : Fri Apr 22 2005 - 05:28:17 CST