Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Kenneth Whistler ([email protected])
Date: Fri Feb 23 2001 - 14:48:17 EST

Next message: [email protected]: "Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Previous message: Tex Texin: "Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i"
Maybe in reply to: Tom Lord: "An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Next in thread: [email protected]: "Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark said:

> In somewhat more detail:
>
> In general, a single abstract character corresponds to a single code point.
> However, due to the requirement of compatibility with legacy code sets, plus
> some inherent fuzziness in what constitutes abstract characters, there are
> cases where this is not true:

And I'll try to help with the visualization, by providing prototypical
instances of each of these cases:

>
> - one abstract character can correspond to two different code points

{a with ring above} ==> U+00C5 LATIN CAPITAL LETTER WITH RING ABOVE
==> U+212B ANGSTROM SIGN (singleton canonical equivalence
to U+00C5)

This is only the most notorious example. There are hundreds of such
examples to be found among the CJK Compatibility characters.

> - one abstract character can correspond to a sequence of two code points

{a with ring above} ==> <U+0041, U+030A>

The obvious instances of precomposed characters, and in particular
canonical composed character sequences.

> - one code point can correspond to two different abstract characters

{Latin baseline ellipsis}
==> U+2026 HORIZONTAL ELLIPSIS
{CJK centerline ellipsis}

{Greek capital alpha}
==> U+0391 GREEK CAPITAL LETTER ALPHA
{Coptic capital alpha}

These are instances of unifications for the encoding. Some we deal with
and get on with our lives. Other provoke arguments for disunification,
as for the Coptic example.

> - one code point can correspond to a sequence of two abstract characters

{f} + {i} ==> U+FB01 LATIN SMALL LIGATURE FI

--Ken

Next message: [email protected]: "Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Previous message: Tex Texin: "Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i"
Maybe in reply to: Tom Lord: "An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Next in thread: [email protected]: "Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT