From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 14 2009 - 17:13:30 CST
> Asmus Freytag wrote:
>
> >> Seems to me that "compatibility characters" means whatever you want
> >> it to mean at a given moment.
> > I simply follow the definition. See, for example the glossary:
> >
> > "/Compatibility Character. /
> > A character that would not have been encoded except for compatibility
> > and round-trip convertibility with other standards"
Yukka Korpela responded:
> It's a pseudo-definition.
Which is nonsense, I'm afraid. What Asmus cited is a descriptive
definition of the term, as used by the folks in the UTC
(past and current) who have developed and maintain the standard.
> It does not make it possible to say, for any given
> character, whether it is a compatibility character or not.
Nor is it intended to. It is intended to capture the meaning of
the term by those who use it in the technical committee.
> The
> pseudo-definition refers to assumed intents and motives, not to the standard
> or accompanying documents. How would you decide whether a particular
> characters is a compatibility character? I mean "you" generically, including
> people who just read the standard and were not personally involved in the
> standards work and don't even know anyone who was.
By asking a knowledgeable participant if they think the
character in question would not have been encoded except
for compatibility and round-trip convertibility with
other standards.
And yes, that does imply that there is no logically
air-tight, digitally testable answer to the question,
"Is U+XXXX a compatibility character?", any more than
there is to the question, "Was encoding U+XXXX a
bad idea for the standard?" Such questions involve
value judgements.
Examples:
There would be little controversy amongst the character
encoding community in determining that U+2502 BOX DRAWINGS LIGHT
VERTICAL is a "compatibility character". There is consensus
that there was no reason to encode obsolete box-drawing
character cell graphics, *except* that they were needed
for round-trip compatibility with existing code pages and
other standards. In fact, it is easy to locate information
about such mappings, e.g. for Code Page 437:
0xb3 0x2502 #BOX DRAWINGS LIGHT VERTICAL
On the other hand, there would be much more controversy
over any determination as to the compatibility character
status of something like U+00E0 LATIN SMALL LETTER A WITH GRAVE.
One's position on that depends in part on deep philosophical
differences endemic to the architectural decisions taken
early on for Unicode. Those who felt strongly that the
Latin script should be encoded entirely as decomposed with
combining marks roundly denounced the precomposed Latin
characters as "mere" compatibility characters, while
others insisted that all precomposed Latin characters
in 8859 8-bit standards had to be encoded as characters
in Unicode, "for compatibility with and 1-to-1 mapping
to" those important existing standards. Very few
current members of the UTC would consider U+00E0 a
"compatibility character", but by my personal reckoning
of the history of the standard, that is precisely what
it is.
And despite years of attempts to clarify different usage
in the standard, the different senses of "compatibility
character" are still routinely confused by lots of people
talking about them.
In particular, "compatibility character" as defined above
is routinely confused with "compatibility decomposable
character", in part because over the years people have
also routinely abbreviated "compatibility decomposable
character" to just "compatibility character".
"Compatibility decomposable character" itself *is*
a formal definition, by the way, for which it is easy
to determine, by algorithm, the exact set of such
characters, for any version of Unicode.
Here, for the record, is a cheat sheet for the
terminology, with examples:
=====================================================
U+0061 LATIN SMALL LETTER A
is *not* a canonical decomposable character
is *not* a compatibility decomposable character
is *not* a compatibility character (clearly)
U+2502 U+2502 BOX DRAWINGS LIGHT VERTICAL
is *not* a canonical decomposable character
is *not* a compatibility decomposable character
*is* a compatibility character (clearly)
U+00E0 LATIN SMALL LETTER A WITH GRAVE
*is* a canonical decomposable character
is *not* a compatibility decomposable character
*is* a compatibility character (arguably)
U+F900 CJK COMPATIBLITY IDEOGRAPH-F900
*is* a canonical decomposable character
is *not* a compatibility decomposable character
*is* a compatibility character (clearly)
U+17C4 KHMER VOWEL SIGN OO
*is* a canonical decomposable character
is *not* a compatibility decomposable character
is *not* a compatibility character (clearly)
U+FF41 FULLWIDTH LATIN SMALL LETTER A
is *not* a canonical decomposable character
*is* a compatibility decomposable character
*is* a compatibility character (clearly)
U+02B0 MODIFIER LETTER SMALL H
is *not* a canonical decomposable character
*is* a compatibility decomposable character
is *not* a compatibility character (arguably)
U+00A0 NO-BREAK SPACE
is *not* a canonical decomposable character
*is* a compatibility decomposable character
is *not* a compatibility character (clearly)
U+0F77 TIBETAL VOWEL SIGN VOCALIC RR
is *not* a canonical decomposable character
*is* a compatibility decomposable character
is *not* a compatibility character (clearly)
[Note that this Tibetan character *is* deprecated, so
for other reasons is considered one which should
not have been encoded, but it was not encoded in
the first place for compatibility with some other
standard.]
=======================================================
Note that while there is never any lack of clarity
about whether a character is or is not a canonical
decomposable character or a compatibility decomposable
character, there is plenty or room for argument
about the status of being a "compatibility character"
for the edge cases.
And there is no clear correlation between the status
of a character as a "compatibility character" and whether
in hindsight its encoding is considered to be "good"
or "bad" for the standard.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2009 - 17:17:08 CST