"Compatibility character" as defined by TUS (was: Re: Emoji: emoticons vs. literacy)

From: Doug Ewell (doug@ewellic.org)
Date: Tue Jan 13 2009 - 22:59:48 CST

  • Next message: verdy_p: "Re: Emoji: emoticons vs. literacy"

    Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

    >> In http://www.unicode.org/mail-arch/unicode-ml/y2009-m01/0077.html I
    >> asked for a pointer to a full definition of "compatibility character"
    >> in the Unicode 5.x text that would cover "a character that is
    >> *completely unrelated* to any other character in the standard but is
    >> encoded due to 'interoperability needs.'"
    >
    > No need to look very far - just check chapter 2 under compatibility
    > character. (You could have easily found out for yourself, since the
    > text is online).

    No sir. I have the physical book right here in front of me, and here is
    what it says on page 23, under the heading "2.3 Compatibility
    Characters." Gather around, everyone, and follow along.

    "Conceptually, compatibility characters are those that would not have
    been encoded except for compatibility and round-trip convertibility with
    other standards. They are variants of characters that already have
    encodings as normal (that is, non-compatibility) characters in the
    Unicode Standard; as such, they are more properly referred to as
    compatibility variants."

    The remainder of this paragraph, and the next two, refer to Arabic glyph
    forms, CJK compatibility ideographs, and other characters that bear
    visual similarity to the characters of which they could be considered
    "variants."

    Then, under the heading "Compatibility Decomposable Characters":

    "There is a second, narrow sense of the term 'compatibility character'
    in the Unicode Standard, corresponding to the notion of a compatibility
    decomposable introduced in Section 2.2, Unicode Design Principles. This
    sense is strictly defined as any Unicode character whose compatibility
    decomposition is not identical to its canonical decomposition."

    This remainder of this paragraph, and the next two, make further
    reference to characters that are typified by their decomposition
    mappings. There is a passage that *almost* appears, at first glance, to
    admit entire de novo sets of symbols:

    "A large number of compatibility decomposable characters are really
    distinct symbols used in specialized notations, whether phonetic or
    mathematical. They are therefore not compatibility variants in the
    strict sense."

    ... but then goes on to explain that they still must be some sort of
    variant of existing characters:

    "Rather, their compatibility mappings express their historical
    derivation from styled forms of standard letters. In these and similar
    cases, such as fixed-width space characters, the compatibility
    decompositions define possible fallback representations."

    And finally, on page 25, under the heading "Mapping Compatibility
    Characters":

    "Identifying one character as a compatibility variant of another
    character usually implies that the first can be remapped to the second
    without the loss of any textual information other than formatting or
    layout. However, such remapping cannot always take place because many
    of the compatibility characters are included in the standard precisely
    to allow systems to maintain one-to-one mappings to other existing
    character encoding standards and code pages. In such cases, a remapping
    would lose information that is important to maintaining some distinction
    in the original encoding. By definition, a compatibility decomposable
    character decomposes into a compatibly equivalent character or character
    sequence. Even in such cases, an implementation must proceed with due
    caution--replacing one with the other may change not only formatting
    information, but also other technical distinctions on which some other
    process may depend."

    This is followed by two paragraphs that go into more detail about the
    relationship between a compatibility character and the "standard"
    character, or sequence with which it is associated, mostly to say that
    the stylistic differences could affect the meaning of the text or cause
    security problems.

    There is NO TEXT HERE that talks about "compatibility characters" that
    have no relationship whatsoever to existing "standard" characters or
    sequences, but are encoded solely due to "interoperability needs" with
    another standard. Even the passage on page 25 about characters that are
    included for 1-to-1 mapping to other standards -- which is probably what
    I was supposed to notice -- speaks of maintaining "some distinction in
    the original encoding." Do we suppose this means a distinction between
    a front-facing baby chick and a side-facing baby chick?

    If you are seeing words on these three pages that are different from the
    ones I am seeing, please quote the words in your reply.

    --
    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Tue Jan 13 2009 - 23:05:43 CST