Re: Improper grounds for rejection of proposal N2677

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 26 2005 - 19:11:38 CST

  • Next message: Asmus Freytag: "Re: Improper grounds for rejection of proposal N2677"

    Jukka said:

    > >I don't see how the addition of new characters could _invalidate_
    > >existing data.

    I wouldn't go so far as Michael has in responding to this.

    Addition of new characters does *not* invalidate existing
    data that used the previously encoded repertoire. The
    standard is careful to guarantee that. The UTC even goes out
    of its way to ensure that additions of new characters don't
    invalidate the *normalized* status of existing normalized data,
    which is an even stronger constraint.

    The problem, in this particular case, is precisely the kind
    of practical problem that Jukka has surmised. If you have
    an existing A-F (which we do) that have been used for decades
    for hexadecimal numeric representation (which they have -- this
    practice long predates the Unicode Standard, and was inherited
    into Unicode from ASCII itself), then proposing to add *another*
    A-F, using characters that look just like the existing A-F,
    but which are posited to be only hexadecimal digits (and *not*
    letters -- even though they look just like the letters they
    are cloned from), then all hell breaks loose in *future* processing
    of hexadecimal numeric expressions.

    The problem isn't that existing software would break, but rather
    that it would be then gradually forced (and inconsistently and
    asynchronously) to deal with the addition of these 6 digits
    that behave differently than all those processes are currently
    handling hexadecimal expressions. Most software simply wouldn't
    change, but you would have opened the dike to the drip, drip,
    drip of people wanting to use the new digits because they
    "fix" hexadecimal numbers, and filing bugs and badgering
    customer support because your software doesn't "support"
    Unicode correctly.

    Furthermore, the whole concept just isn't thought through.

    A-F have casepairs: a-f.

    It doesn't make any sense for hexadecimal digits, if they are
    really *numbers*, not letters, to have case pairs.
    So let's presume that the 6 new digits are @#$%^& for
    10, 11, 12, 13, 14, 15, respectively. [I'm just picking
    6 random symbols here to indicate these are distinct
    from the existing U+0041..U+0046.]

    Currently, hexadecimal representation assumes case folding,
    because it involves A-F *and* a-f as alternates. So
    0xAB4C can also be represented as 0xab4c, depending on my
    style guidelines. In ASCII (or Unicode), that is simply
    two strings, separately encoded, and the equivalence between
    them is implemented, in numerical parsers and formatters,
    via case folding.

    Now let's say I want to represent the number 43,852 using the
    new characters. That would end up being "@#4$", and wouldn't
    require the "markup" of "0x" mentioned by the OP, because it
    contains only digits, and no letters. (Actually, not even
    that is correct, because in principle it could also be a
    radix 13, 14, or 15 number, as well as a radix 16 number,
    but that aside. ... ) The issue now is that I have a
    formatting and display problem that I didn't have before, because
    I need to be able to display "@#4$" as either "AB4C" or
    "ab4c", depending on style. Either I artificially introduce
    *another* casing distinction into my brand spanking new
    hexadecimal digit characters, or I have introduced a *new*
    style markup problem into my hexadecimal digit display that
    I didn't have before.

    And on and on... I haven't even started on the apoplectic
    fits that would be thrown by security people were Unicode
    to introduce identical-looking clones for 6 ASCII letters,
    claiming that they were *only* hexadecimal digits.

    What we had here was essentially a case of well-intentioned
    but ill-advised systematizing by a rather eccentric proposal
    writer, without a clue as to what the actual impact would
    be on existing systems were anybody to actually attempt to
    support it in any way. Furthermore, it was completely
    unmotivated, because it failed to demonstrate that anybody
    is actually suffering in the handling of hexadecimal numeric
    expressions encoded as they currently are -- and have been
    for decades.

    By the way, it isn't the role of the UTC *or* of WG2 to
    publish explanations that will be convincing to any proposal
    writer, no matter how eccentric, that their proposal was
    wrong and that WG2 was justified in rejecting it. Only the
    most reasonable (and generally plugged-in) participants
    tend to react that way. Everybody else who has gotten that
    far in the process tends to *know* they are right, and
    will reject whatever justification is presented by WG2, no
    matter how thorough and logical the argumentation provided
    for them. (We've seen similar kinds of behavior happening
    here right on this thread.)

    The role of the UTC and of WG2 is to maintain the Unicode
    Standard and ISO/IEC 10646 and to make decisions regarding
    character additions. They have open processes for that,
    and people can get involved and influence those decisions,
    but ultimately decisions are taken, and the committees move
    on to the next decisions. It is a fundamental misunderstanding
    of those processes to insist that WG2 then behave like a
    panel of academics and write up logical explanations that
    will convince the world of the irrefutable correctness of
    every decision they have taken, item by item.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Oct 26 2005 - 19:12:50 CST