Re: Surrogate space in Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Feb 16 2001 - 13:06:46 EST


Tom Lord asked:

> > It has proven difficult to come up with convenient terms for
> > the Unicode characters encoded at U+10000 and beyond.
> > [....]
> > 2. A 'basic' code point, which may represent a 'basic
> > character', can range from U+0000 through U+FFFF.
> >
> > For what purpose is such a distinction needed?
>

And Doug Ewell answered:

> It is needed because of UTF-16, which requires two 16-bit code points to
> represent a character with a value of U+10000 or higher (a supplementary
> character) but only one 16-bit code point to represent a basic character.

This is correct, except that it is two 16-bit code *units* required to
represent supplementary characters.

For the UTF-32 encoding form, there is nothing special about supplementary
characters (characters whose Unicode scalar value, i.e. code point, is
between 0x10000 and 0x10FFFF), except that they've only recently started
to be standardized.

For the UTF-8 encoding form, supplementary characters are represented in
4 bytes, while basic characters are represented in 1, 2, or 3 bytes. This
could have an implication for an implementation, although proper UTF-8
implementations should already be handling them correctly. The big issue
is for UTF-8 implementations that *incorrectly* handle supplementary
characters as sequences of two 3-byte representations of surrogate code
points. In order to talk meaningfully about those issues, a terminological
distinction between basic and supplementary characters is useful.

For the UTF-16 encoding form, as Doug pointed out, the difference is between
1 code unit versus 2 code units for representation of a code point.
That distinction is rather significant for many Unicode implementations,
and again a terminological distinction is useful.

Finally, for comparison to ISO/IEC 10646, it is also useful to have a
terminological distinction that lines up with the international standard.
10646 has settled on the term "supplementary planes" to refer to Planes
1 through 16, so the use of the term "supplementary character" in Unicode
to refer to characters encoded on the supplementary planes makes it easier
to understand what is intended, no matter which of the two standards you
are coming from.

>
> Many descriptions on the Web erroneously claim that Unicode contains only the
> first 64K characters of ISO 10646. Even the Unicode Standard Version 3.0
> states, "Plain Unicode text consists of sequences of 16-bit character codes."
> To me this sentence is very misleading and requires that special attention
> be paid to the nature of supplementary characters, those to be assigned in
> Unicode 3.1 and those to be assigned in future versions.

That sentence will be updated eventually.

The critical piece of text in the standard is conformance clause C1 on
page 37, which currently reads:

"C1 A process shall interpret Unicode code values as 16-bit quantities.

* Unicode values can be stored in native 16-bit machine words."

In Unicode 3.1, about to be published in UAX #27, that wording is being
changed to:

"C1 A process shall interpret the Unicode code units in accordance with
the Unicode Transformation Format used.

* The Unicode Standard defines code points (scalar values) that can
be encoded in any of three transformation formats (encoding forms):
UTF-8, UTF-16, or UTF-32."

The PDUTR #27 text currently accessible on the website does not yet
show this change, which was just accepted at the recent UTC meeting,
but expect an updated text for what will eventually become UAX #27
to show up on the site in approximately a week.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT