O.k., some terminological discussion first.
ISO 10646:
10646 does not make a formal distinction between
base characters and precomposed (or composite)
characters. Characters in 10646 are simply the
elements which are encoded in the standard.
10646 DOES make a formal distinction between
combining characters and non-combining characters.
In 10646 non-combining characters followed by
one or more combining characters (termed a
"composite sequence") are associated with a
resultant "graphic symbol", but that graphic
symbol does not have status as a character.
Unicode:
Unicode formally defines base characters,
decomposable characters, and decomposition.
Characters in 10646 have base or decomposable
characteristics, in addition to being the
elements encoded in the standard.
Unicode makes the same formal distinction between
combining characters and non-combining characters
as 10646.
Unicode formally defines canonical equivalence
between base characters followed by combining
characters (of the subtype non-spacing marks),
and any character whose decomposition has the
same combining character sequence.
The upshot of this is that for the 10646 formalists:
U+0041 A + U+0301 COMBINING ACUTE ACCENT has a
graphic symbol that looks like U+00C1 LATIN CAPITAL
LETTER A WITH ACUTE, but it isn't the same as U+00C1.
For the Unicode formalists:
U+0041 A + U+0301 COMBINING ACUTE ACCENT is
canonically equivalent to U+00C1 LATIN CAPITAL
LETTER A WITH ACUTE, and there are conformance
implications which prevent interworking processes
from enforcing a distinction in interchange, though
it may observe a distinction in processing.
I think it would be fair to say that much of the remaining
contention and disagreements swirling around Level 1/2/3 in
10646 and differing opinions about the desirability of
encoding more Latin characters with various accents, basically
come down to different assessments of the implications of
these two formal approaches.
It is my contention that the sticking points are primarily
philosophical and language-political at this point. Sharply
held opinions that "my letter" deserves to be encoded as
a character are as likely to drive encoding decisions
in the standards process as any implementation considerations.
However, I hope that the recent comments by engineers who
have working Unicode implementations will help convince people
that there are no truly insurmountable problems in implementing
combining characters that would justify manning the parapets
for a desperate defense of the Level 1 bastion against the
horde of combiners sowing chaos.
And if people could manage to be a little less concerned with
maintaining the fine points of ontos versus phainos for
characters, we could start to focus instead on defining and
believing in the abstractions that the software presents
to users as more significant than the identity of the encoded
data elements.
--Ken Whistler
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT