Re: U+xxxx, U-xxxxxx, and the basics

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 08 2000 - 18:09:11 EST


Keld responded to Mike Brown's questions:

>
> On Fri, Mar 03, 2000 at 04:57:21PM -0800, Mike Brown wrote:
> >
> > The mapping of abstract characters from a character repertoire to integers
> > in a code space is called a "coded character set". Other names for such
> > mappings are "character encoding", "coded character repertoire", "character
> > set definition", or "code page". Each abstract character in a coded
> > character set is an "encoded character".
>
> "character encoding" alsoincludes a transformation format or coded character
> set shifting techniques, such as ISO 2022. So delete that term here.

Yes, "character encoding" has been misapplied in such ways, but most of
us understand it to be a synonym for "coded character set". UTR #17 tries
to clarify these distinctions, so as to minimize the misapplication of
such terms in the future.

>
> > In Unicode, each abstract character is mapped to a scalar value in the range
> > 0x0..0x10FFFF. This "Unicode scalar value" uniquely identifies the
> > character. Within that 0x0..0x10FFFF range, there are certain sub-ranges
> > that are not assigned to characters by the standard; they are reserved for
> > special functions, future extension mechanisms or private character
> > assignments.
>
> Is it so? Last time I looked Unicode characters did not go beyond 0xFFFF
> Then "surrogates" were defined as characters, and two surrogates
> could be joined to form something else (this terminology
> was in conflict with 10646).

Surrogates were never *defined* as characters. See Unicode 2.0, page 3-7:

D25 high-surrogate: A Unicode code value in the range U+D800 through U+DBFF.

D26 low-surrogate: A Unicode code value in the range U+DC00 through U+DFFF.

D27 surrogate pair: a coded character representation for a single a single
    abstract character which consists of a sequence of two Unicode values,
    where the first value of the pair is a high-surrogate and the second
    is a low-surrogate.

This is *exactly* the same as the definitions now found in Unicode 3.0,
page 45.

Appendix C clarifies the terminological correspondences between Unicode
surrogates and 10646: "In ISO/IEC 10646, high-surrogates are called RC-elements
from the high-half zone and low-surrogates are called RC-elements from
the low-half zone."

Also, because Unicode 2.0 officially adopted UTF-16, user-defined characters
from Planes 15 and 16 were accessible in Unicode since that time -- so there
was no limitation to 0x0..0xFFFF. (Though many implementations chose not
to make use of scalar values past 0xFFFF.)

Please do not misinterpret the Unicode Standard, and then imply that Unicode
is wrong, simply because your misinterpreted version of it is wrong.

> >
> > 3. Code values, or "code units", are numbers that computers use to represent
> > abstract objects, such as Unicode characters. Code values are typically
> > 8-bit, 16-bit, or 32-bit wide non-negative integers. An encoded character,
> > or rather, the integer representing an abstract character in a coded
> > character set, can be mapped to a sequence of one or more code values. This
> > mapping is called an "encoding form".
>
> What is "code values" ? Bytes/octets? Normally you would only map
> the integer to one value.

Please read the standard. (p. 41 in Unicode 3.0)

D5 Code value: the minimal bit combination that can represent a unit of
   encoded text for processing or interchange.

The term "code value" is used synonymously with "code unit". For that, see
UTR #17.

The Unicode 2.0 text mistakenly claimed that "code value" could also be
equated to "code point" (= "code position" in 10646 terminology). That
has been cleaned up in the text, in the light of the clarifications of UTR #17.

And no, it is not always the case that you map the integer to one "code value".
DBCS encodings and UTF-16 are both examples where the integer is mapped to
more than one code value.

> >
> > In UTF-16, each 16-bit code value in the 0x0..0xC7FF range and the
> > 0xD800..0xFFFF range directly corresponds to the same scalar value, while a
> > "surrogate" pair of 16-bit code values algorithmically represents a single
> > scalar value in the range 0x010000..0x10FFFF. The first half of the pair is
> > always in the 0xD000..0xD7FF range, and the second half of the pair is in
> > the 0x0..0xFFFF range. Unicode 3.0 and ISO/IEC 10646-1;2000 have adopted the
> > UTF-16 mechanism as the only official usage of the 0xD000..0xD7FF scalar
> > range.
>
> It is ISO/IEC 10646-1:2000 (note colon instead of semicolon).
> Previous versions of 10646 also had UTF-16 in there (since AMD 4 I think).

UTF-16 has been officially in Unicode since the Unicode Standard, Version 2.0,
published in 1996.

UTF-16 became officially a part of 10646 as a result of Amendment 1, also published
in 1996.

Amendments to 10646 do not create "versions" of 10646, except in a trivial
sense.

ISO/IEC 10646-1:1993 was the first *edition* of 10646-1.

ISO/IEC 10646-1:2000 (incorporating Amendments 1-31, Technical Corrigenda 1-2,
   and all editorial corrigenda to date) is the second *edition* of 10646-1.

ISO/IEC 10646-1:2000, the second edition of 10646-1, is the initial edition
that incorporates the text of Amendment 1 into the published text of the
entire standard.

> >
> > 4. Each abstract character has one or two "Unicode values", which is the
> > code value or pair of code values that represent that character's scalar
> > value in the UTF-16 encoding form. Unicode uses a "U+xxxx" notation to
> > designate Unicode values. Since Unicode values are UTF-16 code values,
> > encoded characters with scalar values in the 0x0..0xFFFF range are
> > represented with one U+xxxx designation, and encoded characters with scalar
> > values in the 0x010000..0x10FFFF range are represented with a pair of U+xxxx
> > designations.
>
> This is different from 10646, which only have one canonical value
> for an abstract character. Maybe you can say that too.

Except for the differences in the way the concepts are described and the
exact terms used in the text, this is *exactly* what is intended in both
the Unicode Standard and 10646.

Again, read the standards.

Annex C of 10646-1:2000 (UTF-16)

 "UTF-16 provides a coded representation of over a million graphic characters
  of UCS-4 in a form that is compatible with the two-octet BMP form of UCS-2."

 Translated into Unicode speak, that is: "UTF-16 (the default encoding form
 for Unicode) makes use of surrogates to represent over a million abstract
 characters in a form that is compatible with use of 16-bit code values."

Annex D of 10646-1:2000 (UTF-8)

 "UTF-8 is an alternative coded representation form for all of the characters
  of the UCS."

So 10646-1:2000 explicitly has two alternative coded representation forms
(not just the one canonical value of UCS-4). And when you make use of the
UTF-16 alternative coded representation form, characters are represented
exactly as described above.

Please don't continue to sow seeds of doubt among implementers about
differences between the Unicode Standard and 10646 that don't exist.
WG2 and the UTC have worked very hard to keep these things in synch.

There *are* differences -- most notably in the text of the standards and
the terminology used. But in almost all instances, the *intent* of both
standards is identical, by design.

> >
> > 5. An algorithm for converting code values to a sequence of 8-bit values
> > (bytes, octets) for cross-platform data exchange is a "character encoding
> > scheme". Encoding forms that produce 7-bit or 8-bit code value sequences
> > don't need additional processing, so UTF-8, for example, can be considered
> > to be both a character encoding form and a character encoding scheme. Other
>
> not so. UTF-8 always implies both the algoritm and the specific codes.
> So it is not a character encoding scheme.

Yes so. The term "UTF-8" is used in both ways. It is used to mean the
character encoding form (as above for 10646). But it is also used when
people are referring to the way UTF-8 data is serialized as bytes, in which
case it is contrasted with UTF-16BE and UTF-16LE.

>
> > encoding forms, however, need to have a consistent mechanism applied to
> > convert their 16-bit or 32-bit code value sequences to 8-bit sequences.
> > Unicode 3.0 has the character encoding schemes UTF-16BE and UTF-16LE for
> > this purpose. These work like UTF-16 but break up each code value into a
> > sequence of pairs of bytes, with each byte pair being either in Big Endian
> > order for UTF-16BE (the byte with the most significant bits comes first) or
> > Little Endian order for UTF-16LE.
>
> UTF-16LE and UTF-16BE are not encoding schemes either.

Of course they are. See UTR #17 for the explicit claim. Your use of
"encoding scheme" apparently does not match that defined in UTR #17.

>
> > 6. A "character map" correlates an abstract character in a character
> > repertoire with a specific sequence of bytes. Other words for a character
> > map are a "character set", "charset" (as in the IANA registry), "charmap",
> > or sometimes "code page".
>
> Please do not promote the term "character set" in this relation.

I agree that this use of "character set" is unclear and should not be
encouraged. But I would argue that most implementers seeing "charset"
equate it with the term "character set", so this is a de facto usage,
however infelicitous.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT