Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Peter_Constable@sil.org
Date: Fri Feb 23 2001 - 14:59:20 EST


On 02/23/2001 09:58:55 AM John Cowan wrote:

>Mark Davis wrote:
>
>>> A _code_point_ is an integer value which is assigned to an abstract
>>> character. Each character receives a unique code point.
>>
>>
>> inaccurate. Multiple *abstract characters* can have a single code point;
>> multiple code points can correspond to a single *abstract character*.
>
>TUS 3.0 is vague on this, but I suppose what is meant is that if two
>single characters are canonically equivalent, they constitute only one
>abstract character. Does U+0041 U+0300 represent one abstract
>character (the same as the abstract character represented by U+00C0)
>or two consecutive abstract characters? If the former, does U+0051
>U+0300 also represent an abstract character?

I'm surprised at what Mark wrote. In the sense of abstract character as
defined in the Standard, 0041 and 0300 represent distinct abstract
characters, and the sequence <0041, 0300> does not represent an abstract
character but a sequence of abstract characters that happen to be
canonically equivalent (with is not the same as "is the same as") to the
abstract character 00C0. The sequence <0041, 0300> may represent a single
grapheme in a particular writing system, but that's also another matter.

I think Mark is either temporarily off his game, or else he's obfuscating
terminology. "Abstract character" is defined in definition D3 on p. 40 of
TUS3.0. The relationship between abstract characters and codepoints is
defined in UTR17: "An abstract character is defined to be in a coded
character set if the coded character set maps from it to an integer. That
integer is said to be the code point for the abstract character." UTR17
doesn't make this clear, but the mapping between abstract characters and
integers is a bifurcation, i.e. 1:1. Thus, it is impossible for multiple
abstract characters (as here defined) to map to a single codepoint, or for
a single abstract character to map to multiple codepoints.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT