Re: surrogate terminology

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Sep 12 2000 - 16:07:26 EDT


Peter noted:

> > We do need to clean up terminology, and we need to do so in a way that
> > incorporates understanding of UTR-17. I think we need:
> >
> > - BMP characters: characters in the BMP; note that d800-dfff are not
> > characters; fffe and ffff are also not characters
> > - "astral"/supplementary/extended-plane/?? characters: everything in planes
> > 1 - 16 (excluding anything ending in fffe and ffff)

This is part of a discussion of terminology regarding surrogates
that has been ongoing among an ad hoc group working on the proposed
UTR on surrogate handling, and a separate but related discussion
among the editorial committee. Now it seems to have migrated out
to the general list.

Misha noted:

>
> I can't stand "astral planes". The term suggests, to me at
> least, that these planes (and, hence, the characters in them)
> are not as "real" as the BMP.
>
> By contrast, "supplementary planes" is a factual description.
>

I'll repeat some of the consensus that seems to have emerged from
the other smaller list discussions.

1. The terminology used by 10646 and by the Unicode Standard should
   be convergent in this area, to minimize the proliferation of
   confusion. The FCD for 10646-2 already uses the term "supplementary
   planes", and this seems perfectly acceptable for the Unicode
   Standard as well.

10646 definition:

plane: A subdivision of a group; of 256 x 256 cells.

Suggested Unicode definitions that could be added to the Unicode
glossary, to cover this convergence:

plane: A subdivision of the encoding space; 64K code points starting
       on an even 64K boundary. (Plane 0 0x0000..0xFFFF; Plane 1 0x10000..
       0x1FFFF, etc.)

BMP: Basic Multilingual Plane, a synonym for Plane 0.

SMP: Supplementary Multilingual Plane, a synonym for Plane 1.

The Supplementary Planes: The collective term for Planes
       1 through 16, considered as a group.

The Astral Planes: Jocular synonym for the Supplementary Planes.

2. The plane names in the FCD for 10646-2 should be modified just
   slightly to tie together the terminology better. The best
   suggestion to date is:

>Plane 1: Supplementary Multilingual Plane for scripts and symbols (SMP)
>Plane 2: Supplementary Ideographic Plane (SIP)
>Plane 14: Supplementary Special-purpose Plane (SSP)

   This makes consistent use of "supplementary plane", and ties the
   plane names and acronyms together in a way which can actually be
   remembered without having to look up the TLA's.

3. The term "surrogate character" should be eschewed altogether, because
   of the confusion is causes. "Surrogate code point" can continue to
   be used as it currently is, and the term "surrogate pair" is also
   useful. But the other terminology related to characters should be
   coordinated with establishing "supplementary planes" as the way to
   refer to Planes 1-16. Some text I wrote earlier about this topic,
   in response to a suggestion to use the terms "extended character"
   and "basic character":

I don't like "extended character", because of the cognitive dissonance
regarding whether the character is an ordinary character that extends
the set located elsewhere, or whether the character itself is extended
in some way -- that is bound to cause confusions, since the UTF-16
encoding scheme for these "extended characters" extends the encoding
form to 2 wydes, as well as extending the character set by adding
the character.

Because of that, I think "supplementary character" is a far better choice
for talking about characters on Planes 1-16. There can be no confusion
there with the mechanics of the encoding form, and there is no artificial
discrimination in that term regarding the status of the good characters
we like in the Supplementary Planes versus the bad characters we don't like
in the Supplementary Planes -- just as for characters in the BMP.

And I would prefer not to start talking about characters in the BMP
as "basic characters", since, as we know, there are many thousands of
them that aren't particularly basic (or useful for implementation).

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT