Re: Letters vs. precomposed characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 30 1996 - 14:19:55 EDT


O.k., some terminological discussion first.

ISO 10646:

        10646 does not make a formal distinction between
        base characters and precomposed (or composite)
        characters. Characters in 10646 are simply the
        elements which are encoded in the standard.

        10646 DOES make a formal distinction between
        combining characters and non-combining characters.

        In 10646 non-combining characters followed by
        one or more combining characters (termed a
        "composite sequence") are associated with a
        resultant "graphic symbol", but that graphic
        symbol does not have status as a character.

Unicode:

        Unicode formally defines base characters,
        decomposable characters, and decomposition.
        Characters in 10646 have base or decomposable
        characteristics, in addition to being the
        elements encoded in the standard.

        Unicode makes the same formal distinction between
        combining characters and non-combining characters
        as 10646.

        Unicode formally defines canonical equivalence
        between base characters followed by combining
        characters (of the subtype non-spacing marks),
        and any character whose decomposition has the
        same combining character sequence.

The upshot of this is that for the 10646 formalists:

        U+0041 A + U+0301 COMBINING ACUTE ACCENT has a
        graphic symbol that looks like U+00C1 LATIN CAPITAL
        LETTER A WITH ACUTE, but it isn't the same as U+00C1.

For the Unicode formalists:

        U+0041 A + U+0301 COMBINING ACUTE ACCENT is
        canonically equivalent to U+00C1 LATIN CAPITAL
        LETTER A WITH ACUTE, and there are conformance
        implications which prevent interworking processes
        from enforcing a distinction in interchange, though
        it may observe a distinction in processing.

I think it would be fair to say that much of the remaining
contention and disagreements swirling around Level 1/2/3 in
10646 and differing opinions about the desirability of
encoding more Latin characters with various accents, basically
come down to different assessments of the implications of
these two formal approaches.

It is my contention that the sticking points are primarily
philosophical and language-political at this point. Sharply
held opinions that "my letter" deserves to be encoded as
a character are as likely to drive encoding decisions
in the standards process as any implementation considerations.

However, I hope that the recent comments by engineers who
have working Unicode implementations will help convince people
that there are no truly insurmountable problems in implementing
combining characters that would justify manning the parapets
for a desperate defense of the Level 1 bastion against the
horde of combiners sowing chaos.

And if people could manage to be a little less concerned with
maintaining the fine points of ontos versus phainos for
characters, we could start to focus instead on defining and
believing in the abstractions that the software presents
to users as more significant than the identity of the encoded
data elements.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT