From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Nov 14 2003 - 20:35:32 EST
John Cowan said:
> Kenneth Whistler scripsit:
>
> > However, there were character encoding standards committees,
> > predating the UTC, which did not understand this principle,
> > and which encoded a character for the Ångstrom sign as a
> > separate symbol. In most cases this would not be a problem,
> > but in at least one East Asian encoding, an Ångstrom sign
> > was encoded separately from {an uppercase Å of the Latin script},
> > resulting in two encodings for what really is the same thing,
> > from a character encoding perspective.
>
> But IIRC they did so in two separate character encoding standards
> which the UTC for reasons of its own decided to treat as one standard.
Yeah, could be.
The issue can be seen in JIS X 0208, which has an Ångstrom symbol
(Row 2, Cell 82), but no accented Latin, and then JIS X 0212,
which has a bunch of accented Latin, including Å (Row 10, Cell 9).
They are separate standards, but JIS X 0212 was designed as
a discontiguous extension of JIS X 0208. You aren't supposed
to unify its characters against the JIS X 0208 characters.
The fact that JIS X 0212 basically failed, and has been replaced
by a rather different JIS X 0213 extension wasn't something that
could be foreseen in detail back in 1989 when these initial
repertoires were being collected.
>
> > Note that there a also piles of "compability characters" in
> > Unicode which have no decomposition mapping whatsoever,
> > and which thus are completely unimpacted by normalization.
>
> If someone undertook to prepare a draft list of these, would the
> UTC consider blessing it, in corrected form? It is disconcerting
> that the notion "compatibility character" is so fuzzily defined.
Actually, part of the point of my discussion of compatibility
characters is to indicate that "compatibility character" per se
*is* a very fuzzy and contingent concept. It is basically a
matter more of character encoding history than something that
should be normatively defined so as to have implementations and
other specifications depend on in some crucial way. Even longtime
experts on the UTC will have disagreements regarding just which
characters are "compatibility characters" and which not. My
statement that Å (and by implication most other precomposed
Latin characters) are compatibility characters in a way would
itself be somewhat contentious. It depends in part on what
your vision is regarding how Unicode *should* be, as opposed to
just what it currently is defined to be.
What matters for implementations and related specifications
are the normatively defined statuses of certain characters as
having decomposition mappings that designates them either
as compatibility decomposable characters or as canonical
decomposable characters. That status *is* clearly and unambiguously
defined for every Unicode character.
Rather than trying to figure out what all the compatibility
characters are, I think a much more interesting list would be
the list of all the *useful* characters in Unicode.
In other words, while the IRG and other committees are busy
haggling over what the Basic CJK Subset should be, which would
be useful for small implementations of Han, maybe the rest
of you could come up with what the Basic Non-CJK Subset of
Unicode should be, omitting all the accumulated dreck of
duplications, mistakes, misguided experiments, and modelling
errors inherited from older encodings (or stuffed into Unicode
by the UTC or WG2).
--Ken
This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 21:10:48 EST