Mark said:
> In somewhat more detail:
>
> In general, a single abstract character corresponds to a single code point.
> However, due to the requirement of compatibility with legacy code sets, plus
> some inherent fuzziness in what constitutes abstract characters, there are
> cases where this is not true:
And I'll try to help with the visualization, by providing prototypical
instances of each of these cases:
>
> - one abstract character can correspond to two different code points
{a with ring above} ==> U+00C5 LATIN CAPITAL LETTER WITH RING ABOVE
==> U+212B ANGSTROM SIGN (singleton canonical equivalence
to U+00C5)
This is only the most notorious example. There are hundreds of such
examples to be found among the CJK Compatibility characters.
> - one abstract character can correspond to a sequence of two code points
{a with ring above} ==> <U+0041, U+030A>
The obvious instances of precomposed characters, and in particular
canonical composed character sequences.
> - one code point can correspond to two different abstract characters
{Latin baseline ellipsis}
==> U+2026 HORIZONTAL ELLIPSIS
{CJK centerline ellipsis}
{Greek capital alpha}
==> U+0391 GREEK CAPITAL LETTER ALPHA
{Coptic capital alpha}
These are instances of unifications for the encoding. Some we deal with
and get on with our lives. Other provoke arguments for disunification,
as for the Coptic example.
> - one code point can correspond to a sequence of two abstract characters
{f} + {i} ==> U+FB01 LATIN SMALL LIGATURE FI
--Ken
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT