Re: String name and Character Name

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 12 2005 - 16:02:12 CST

  • Next message: John Hudson: "Re: String name and Character Name"

    Ed Trager responded to John Hudson:

    > In as much as no one wants the standard to include incorrect
    > and meaningless things,

    And you will find general consensus on that point.

    > then
    > I think it is perfectly reasonable to contemplate changing
    > incorrect and meaningless names.

    It is perfectly reasonable to contemplate this, which is why
    reasonable people on this list *are* contemplating it.

    > But perhaps we can conclude that it is not a high priority
    > on the agendas of the relevant parties.

    But on that point, you may be misreading the consensus among
    the standards participants on this list.

    It is not the case that:

       It is not a high priority of the relevant parties to
       change incorrect and meaningless names.
       
    It is the case that:

       It *is* a high priority of the relevant parties *not*
       to change *any* character name, once published.
       
    > For anyone to say, "it cannot be changed and won't be changed"
    > without a very good explanation
    > of *why* it cannot be changed just sounds like some sort of
    > hubris in this mailing list, probably
    > not intended, but that's what it sounds like.

    Hubris, no.

    Frustration and aggravation at having to explain established
    policies over and over to people who apparently refuse to
    listen, yes.

    This policy dates from a famous ruckus a decade ago over
    the name of æ and Æ.

    1993-07-08:

       Denmark is issuing this defect report to ISO 10646-1:1993
       based on the naming of Danish, Faroese and Greenlandic letter
       "Æ" in upper and lower case and with acute accent. The
       character "Æ" is also used as letter in the Norwegian
       and Icelandic languages. Please find enclosed an official
       statement from the Danish Standards Association concerning
       the Danish letter "Æ". During the process of writing the
       ISO 10646-1:1993, the naming was correct - for example
       "LATIN CAPITAL LETTER AE" - in the second DIS. It was
       changed to "LATIN CAPITAL LIGATURE AE" in the final version
       of the ISO 10646-1 (1993). ...
       
    This defect report took over two years to resolve, with
    Francophones and Scandinavians at loggerheads every step of
    the way, until DCOR No. 1 to 10646-1:1993 was published in
    1996.

    The Unicode Standard, being synchronized with 10646, was dragged
    along in this process.

    Unicode 1.0

      U+00E6 LATIN SMALL LETTER A E
        = ISO LATIN SMALL LETTER AE <-- the name in ISO 8859-1
        
    Unicode 1.1

      U+00E6 LATIN SMALL LIGATURE AE <-- synchronized with 10646-1:1993
        = LATIN SMALL LETTER A E
      
    Unicode 2.0

      U+00E6 LATIN SMALL LETTER AE <-- applied DCOR No. 1 to 10646-1:1993
        = LATIN SMALL LIGATURE AE
        
    The fact this this entire fight, and the attendant confusion it
    left in *all* of the standards documents from the 1993 - 1996
    period, had not one single beneficial consequence for
    implementations of æ and Æ, and that it left bitter feelings all
    around, led both committees to decide that past a certain point
    such defect reports would be noted but not acted upon, insofar
    as they were requests for changes in names of published characters
    in the standards.

    The *stability* of published character names is far more important
    to the network of interdependent standards that refer to
    character encoding standards than is the correctness of the name.

    But wait! Reasonable people will say, "It's a standard. Of course
    the name should be correct. And if it isn't correct, it should
    be corrected, so the standard is correct."

    I trust that is a fair summary of the position that E. Trager,
    P. Kirk, S. Srivas, and others have been maintaining recently
    on this topic.

    To which I can only reiterate, from experience, that the *stability*
    of published character names is far more important than is the
    correctness of the names.

    People who are using the Unicode Standard need to wrap their
    heads around the reality that it is a *character encoding
    standard*. It is *not* the Universal Encyclopedia of Writing
    Systems and Character Identity.

    Unicode character names are normative for the purposes of the
    character encoding standard and those other IT standards that
    reference it. They are also *immutable*, by action of both
    SC2 and the UTC, because change of character names is almost
    as disruptive of the standards as changing code points for
    characters would be.

    This does *NOT* mean that the Unicode Standard is dictating to
    anyone what the name of some letter in their writing system
    should properly be, whether in English or in any other language.

    That this is the case should be obvious from ASCII characters,
    which, after all, have a long history of this kind of concern,
    well predating Unicode's involvement in character encoding.
    Take U+002F SOLIDUS. Not one American English speaker in
    a 100,000 would call '/' a "solidus". Its name is "slash" or
    for older speakers, perhaps "slanted bar", and so forth.
    Use the term "solidus" and everyone will look blankly at you,
    except Classics professors wondering what Roman money has to
    do with it or programming geeks and character encoding mavens,
    who know the term because they read ASCII code charts.
       
    > But even if the mis-named and mis-spelled characters in
    > the Unicode Standard are not changed, there really is
    > nothing stopping me (or you) from displaying what I believe
    > are more correct names for these characters in some
    > website, software, or document that I might write.

    Correct. Note that there is exactly one Unicode names policeman --
    Michael Everson -- and he does not arrest people who display
    alternative names for Unicode characters.

    Nobody is going to object to people reading:

    www.foo.com/index.html

    as "dubdubdub dot foo dot com slash index dot aitch tee em el"

    instead of:

    "LATIN SMALL LETTER W LATIN SMALL LETTER W LATIN SMALL LETTER W
     FULL STOP LATIN SMALL LETTER F LATIN SMALL LETTER O LATIN
     SMALL LETTER O FULL STOP LATIN SMALL LETTER C LATIN SMALL
     LETTER O LATIN SMALL LETTER M SOLIDUS LATIN SMALL LETTER I
     LATIN SMALL LETTER N LATIN SMALL LETTER E LATIN SMALL LETTER X
     FULL STOP LATIN SMALL LETTER H LATIN SMALL LETTER T
     LATIN SMALL LETTER M LATIN SMALL LETTER L"

    One of the reasons *why* the Unicode standard publishes many
    aliases in the Unicode names list is because there often are
    much better, more communicative names for particular characters,
    *EVEN IN ENGLISH* than the normative names in the data file.

    > In this case, common best practices can make up
    > for imperfections in the standard itself.

    Yes. As long as they are not mis-represented as corrections
    *to* the standard, but instead as alternative, more useful
    names for characters *in* the standard.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Apr 12 2005 - 16:02:58 CST