From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 26 2005 - 19:11:38 CST
Jukka said:
> >I don't see how the addition of new characters could _invalidate_
> >existing data.
I wouldn't go so far as Michael has in responding to this.
Addition of new characters does *not* invalidate existing
data that used the previously encoded repertoire. The
standard is careful to guarantee that. The UTC even goes out
of its way to ensure that additions of new characters don't
invalidate the *normalized* status of existing normalized data,
which is an even stronger constraint.
The problem, in this particular case, is precisely the kind
of practical problem that Jukka has surmised. If you have
an existing A-F (which we do) that have been used for decades
for hexadecimal numeric representation (which they have -- this
practice long predates the Unicode Standard, and was inherited
into Unicode from ASCII itself), then proposing to add *another*
A-F, using characters that look just like the existing A-F,
but which are posited to be only hexadecimal digits (and *not*
letters -- even though they look just like the letters they
are cloned from), then all hell breaks loose in *future* processing
of hexadecimal numeric expressions.
The problem isn't that existing software would break, but rather
that it would be then gradually forced (and inconsistently and
asynchronously) to deal with the addition of these 6 digits
that behave differently than all those processes are currently
handling hexadecimal expressions. Most software simply wouldn't
change, but you would have opened the dike to the drip, drip,
drip of people wanting to use the new digits because they
"fix" hexadecimal numbers, and filing bugs and badgering
customer support because your software doesn't "support"
Unicode correctly.
Furthermore, the whole concept just isn't thought through.
A-F have casepairs: a-f.
It doesn't make any sense for hexadecimal digits, if they are
really *numbers*, not letters, to have case pairs.
So let's presume that the 6 new digits are @#$%^& for
10, 11, 12, 13, 14, 15, respectively. [I'm just picking
6 random symbols here to indicate these are distinct
from the existing U+0041..U+0046.]
Currently, hexadecimal representation assumes case folding,
because it involves A-F *and* a-f as alternates. So
0xAB4C can also be represented as 0xab4c, depending on my
style guidelines. In ASCII (or Unicode), that is simply
two strings, separately encoded, and the equivalence between
them is implemented, in numerical parsers and formatters,
via case folding.
Now let's say I want to represent the number 43,852 using the
new characters. That would end up being "@#4$", and wouldn't
require the "markup" of "0x" mentioned by the OP, because it
contains only digits, and no letters. (Actually, not even
that is correct, because in principle it could also be a
radix 13, 14, or 15 number, as well as a radix 16 number,
but that aside. ... ) The issue now is that I have a
formatting and display problem that I didn't have before, because
I need to be able to display "@#4$" as either "AB4C" or
"ab4c", depending on style. Either I artificially introduce
*another* casing distinction into my brand spanking new
hexadecimal digit characters, or I have introduced a *new*
style markup problem into my hexadecimal digit display that
I didn't have before.
And on and on... I haven't even started on the apoplectic
fits that would be thrown by security people were Unicode
to introduce identical-looking clones for 6 ASCII letters,
claiming that they were *only* hexadecimal digits.
What we had here was essentially a case of well-intentioned
but ill-advised systematizing by a rather eccentric proposal
writer, without a clue as to what the actual impact would
be on existing systems were anybody to actually attempt to
support it in any way. Furthermore, it was completely
unmotivated, because it failed to demonstrate that anybody
is actually suffering in the handling of hexadecimal numeric
expressions encoded as they currently are -- and have been
for decades.
By the way, it isn't the role of the UTC *or* of WG2 to
publish explanations that will be convincing to any proposal
writer, no matter how eccentric, that their proposal was
wrong and that WG2 was justified in rejecting it. Only the
most reasonable (and generally plugged-in) participants
tend to react that way. Everybody else who has gotten that
far in the process tends to *know* they are right, and
will reject whatever justification is presented by WG2, no
matter how thorough and logical the argumentation provided
for them. (We've seen similar kinds of behavior happening
here right on this thread.)
The role of the UTC and of WG2 is to maintain the Unicode
Standard and ISO/IEC 10646 and to make decisions regarding
character additions. They have open processes for that,
and people can get involved and influence those decisions,
but ultimately decisions are taken, and the committees move
on to the next decisions. It is a fundamental misunderstanding
of those processes to insist that WG2 then behave like a
panel of academics and write up logical explanations that
will convince the world of the irrefutable correctness of
every decision they have taken, item by item.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Oct 26 2005 - 19:12:50 CST