Compatibility Character (was: Re: Emoji: emoticons vs. literacy)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 14 2009 - 17:13:30 CST

  • Next message: Daniel Goldschmidt: "Worldware conference, Santa Clara CA, March 17-19 2009"

    > Asmus Freytag wrote:
    >
    > >> Seems to me that "compatibility characters" means whatever you want
    > >> it to mean at a given moment.
    > > I simply follow the definition. See, for example the glossary:
    > >
    > > "/Compatibility Character. /
    > > A character that would not have been encoded except for compatibility
    > > and round-trip convertibility with other standards"

    Yukka Korpela responded:

    > It's a pseudo-definition.

    Which is nonsense, I'm afraid. What Asmus cited is a descriptive
    definition of the term, as used by the folks in the UTC
    (past and current) who have developed and maintain the standard.

    > It does not make it possible to say, for any given
    > character, whether it is a compatibility character or not.

    Nor is it intended to. It is intended to capture the meaning of
    the term by those who use it in the technical committee.

    > The
    > pseudo-definition refers to assumed intents and motives, not to the standard
    > or accompanying documents. How would you decide whether a particular
    > characters is a compatibility character? I mean "you" generically, including
    > people who just read the standard and were not personally involved in the
    > standards work and don't even know anyone who was.

    By asking a knowledgeable participant if they think the
    character in question would not have been encoded except
    for compatibility and round-trip convertibility with
    other standards.

    And yes, that does imply that there is no logically
    air-tight, digitally testable answer to the question,
    "Is U+XXXX a compatibility character?", any more than
    there is to the question, "Was encoding U+XXXX a
    bad idea for the standard?" Such questions involve
    value judgements.

    Examples:

    There would be little controversy amongst the character
    encoding community in determining that U+2502 BOX DRAWINGS LIGHT
    VERTICAL is a "compatibility character". There is consensus
    that there was no reason to encode obsolete box-drawing
    character cell graphics, *except* that they were needed
    for round-trip compatibility with existing code pages and
    other standards. In fact, it is easy to locate information
    about such mappings, e.g. for Code Page 437:

    0xb3 0x2502 #BOX DRAWINGS LIGHT VERTICAL

    On the other hand, there would be much more controversy
    over any determination as to the compatibility character
    status of something like U+00E0 LATIN SMALL LETTER A WITH GRAVE.
    One's position on that depends in part on deep philosophical
    differences endemic to the architectural decisions taken
    early on for Unicode. Those who felt strongly that the
    Latin script should be encoded entirely as decomposed with
    combining marks roundly denounced the precomposed Latin
    characters as "mere" compatibility characters, while
    others insisted that all precomposed Latin characters
    in 8859 8-bit standards had to be encoded as characters
    in Unicode, "for compatibility with and 1-to-1 mapping
    to" those important existing standards. Very few
    current members of the UTC would consider U+00E0 a
    "compatibility character", but by my personal reckoning
    of the history of the standard, that is precisely what
    it is.

    And despite years of attempts to clarify different usage
    in the standard, the different senses of "compatibility
    character" are still routinely confused by lots of people
    talking about them.

    In particular, "compatibility character" as defined above
    is routinely confused with "compatibility decomposable
    character", in part because over the years people have
    also routinely abbreviated "compatibility decomposable
    character" to just "compatibility character".

    "Compatibility decomposable character" itself *is*
    a formal definition, by the way, for which it is easy
    to determine, by algorithm, the exact set of such
    characters, for any version of Unicode.

    Here, for the record, is a cheat sheet for the
    terminology, with examples:

    =====================================================

    U+0061 LATIN SMALL LETTER A

       is *not* a canonical decomposable character
       is *not* a compatibility decomposable character
       is *not* a compatibility character (clearly)
       
    U+2502 U+2502 BOX DRAWINGS LIGHT VERTICAL

       is *not* a canonical decomposable character
       is *not* a compatibility decomposable character
       *is* a compatibility character (clearly)
       
    U+00E0 LATIN SMALL LETTER A WITH GRAVE

       *is* a canonical decomposable character
       is *not* a compatibility decomposable character
       *is* a compatibility character (arguably)
       
    U+F900 CJK COMPATIBLITY IDEOGRAPH-F900

       *is* a canonical decomposable character
       is *not* a compatibility decomposable character
       *is* a compatibility character (clearly)
          
    U+17C4 KHMER VOWEL SIGN OO

       *is* a canonical decomposable character
       is *not* a compatibility decomposable character
       is *not* a compatibility character (clearly)

    U+FF41 FULLWIDTH LATIN SMALL LETTER A

       is *not* a canonical decomposable character
       *is* a compatibility decomposable character
       *is* a compatibility character (clearly)
       
    U+02B0 MODIFIER LETTER SMALL H

       is *not* a canonical decomposable character
       *is* a compatibility decomposable character
       is *not* a compatibility character (arguably)
       
    U+00A0 NO-BREAK SPACE

       is *not* a canonical decomposable character
       *is* a compatibility decomposable character
       is *not* a compatibility character (clearly)
     
    U+0F77 TIBETAL VOWEL SIGN VOCALIC RR

       is *not* a canonical decomposable character
       *is* a compatibility decomposable character
       is *not* a compatibility character (clearly)
       
       [Note that this Tibetan character *is* deprecated, so
       for other reasons is considered one which should
       not have been encoded, but it was not encoded in
       the first place for compatibility with some other
       standard.]
       
    =======================================================

    Note that while there is never any lack of clarity
    about whether a character is or is not a canonical
    decomposable character or a compatibility decomposable
    character, there is plenty or room for argument
    about the status of being a "compatibility character"
    for the edge cases.

    And there is no clear correlation between the status
    of a character as a "compatibility character" and whether
    in hindsight its encoding is considered to be "good"
    or "bad" for the standard.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2009 - 17:17:08 CST