Re: Repertoire, encoding, and representation (Was: Charsets + encoding + codesets)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Oct 07 1997 - 19:40:03 EDT


John Cowan wrote.

>
> Kenneth Whistler wrote:
>
> > The Unicode Standard talks about abstract characters. <a-acute> is an
> > example of an abstract character in the Latin script. <d-dental-voiceless>
> > is another example of an abstract character in the Latin script.
>
> Well, this is very clear, and perhaps is the way things *should* be,
> but I don't see that it's what the Standard says, and indeed, it appears
> to me to directly contradict what the Standard says. To paraphrase
> A.P. Herbert on Parliament: if the Unicode Consortium does not mean
> what it says in the Unicode Standard, it must say so.

John has indeed caught me out not being very careful about my
terminology. Mea culpa if this has contributed to the confusion.

The normative definition of "abstract character" on page 3-4
of the Unicode Standard means "character" as used in
ISO standards.

Unicode:

   abstract character: a unit of information used for the organization,
      control, or representation of textual data.

10646:

   character: a member of a set of elements used for the organisation [sic],
      control, or representation of data.

The intent of the Unicode definition is set membership in the
same sense as 10646. If I remember rightly, the use of qualifier
"abstract" here was to emphasize the character as unencoded entity,
apart from the coded value assigned to it. Usually in the text,
the "abstract" is dropped off, so that both standards use the
terms "character" synonymously in this sense.

Ken Whistler's private definition of "abstract character" as used
yesterday and today, does not refer to a member of the set of
of things given encoded values in the standard, but instead to
the larger set of potential units of information, regardless of
their standing as a member of the set of characters per se. Since
that difference seems to be leading to these misunderstandings,
how about substituting some entirely different term for it:
perhaps "grapheme", a significant unit of an orthography or
writing system.

>
> [much entirely correct stuff on combining character sequences
> snipped]
>
> > Keld is, of course, correct that the repertoire of abstract characters
> > is open.
>
> Unfortunately, this remark collides with these statements
> on page 3-4, which are presumptively normative:
>
> A Unicode abstract character is represented by a single
> Unicode code value; the only exception [sic] to this are
> surrogate pairs (which are provided for future extension,
> but are not currently used to represent any abstract
> characters).
>
> (Perhaps this paragraph is not normative, but if so I don't
> see how to tell what parts of Chapter 3 are not normative.)
>
> So the term "coded character representation", which is defined as
> "an ordered sequence of one or more code values which is associated
> with an abstract character", can only refer (in Unicode 2.0) to a
> single codepoint or two successive codepoints forming a surrogate
> pair. It *cannot* refer to a combining character sequence,
> because (in general) a combining character sequence is not
> "represented by a single Unicode code value". <d-dental-voiceless>
> is not so represented, and is not an abstract character.

Correct on the close reading of page 3.4. So restate as follows:

<d-dental-voiceless> is a "grapheme" representable by sequence of
   Unicode (abstract) characters.

Then canonical equivalence is the relationship between alternate
character sequences (which may consist of only a single character
or a combining character sequence) that represent the same
grapheme.

>
> The term "abstract character" is only useful in Unicode 2.0
> for lumping assigned non-surrogate codepoints and assigned
> surrogate pairs (the latter being currently an empty set)
> as corresponding to abstract characters,
> and all other codepoints, including D800-DFFF,
> as not corresponding to abstract characters.
>
> I conclude, therefore, that there are at present 38,885 abstract
> characters in Unicode, all of which are represented by single
> Unicode code values. I wish it were otherwise, but it is not.

How about, I will agree to stop making use of the confusing
term "abstract character" in these so-called clarifications,
if Keld will agree to stop claiming that the Unicode repertoire
is infinite!

--Ken

> To make it so, the offending paragraph of page 3-4 would
> have to be rewritten somewhat as follows:
>
> A Unicode abstract character can be represented by
> a single Unicode code value, or by a single surrogate pair
> (no surrogate pair currently represents
> any abstract character), or by one of several Unicode
> code values which are canonically equivalent, or by a
> combining character sequence, or a sequence of Hangul
> jamos representing a single syllable, or by other means.
>
> --
> John Cowan http://www.ccil.org/~cowan cowan@ccil.org
> e'osai ko sarji la lojban
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT