U+xxxx, U-xxxxxx, and the basics

From: Mike Brown (mbrown@corp.webb.net)
Date: Fri Mar 03 2000 - 20:06:34 EST


Now that I have the Unicode 3.0 book at my disposal, I'm trying to rewrite
an XML tutorial that I started last month, making it more accurate with
regard to confusing terminology like "code position" and "code value". The
text below is an attempt to pare down Unicode Technical Report #17 and
sections 0.2 and 3.3 of the Unicode Standard 3.0. In it I tried to
incorporate some examples of U- and U+ notation, which are not very well
documented. Can someone please look over it and tell me if I got it right?
If so, I will attempt to make some diagrams.

Thanks.

P.S., Assume I've already defined what an abstract character is.

   - Mike
___________________________________________________________
Mike J. Brown, software engineer, Webb Interactive Services
XML/XSL stuff: http://www.skew.org/ http://www.webb.net/

1. In Unicode, each abstract character has a descriptive name, in English,
like "LATIN CAPITAL LETTER A", and may have additional names that are
translations of the English name into other languages.

2. In general, a set of abstract characters is a "character repertoire". A
"code space" is a set of "code points" (or "code positions") that are scalar
values: non-negative, not-necessarily-contiguous integers.

The mapping of abstract characters from a character repertoire to integers
in a code space is called a "coded character set". Other names for such
mappings are "character encoding", "coded character repertoire", "character
set definition", or "code page". Each abstract character in a coded
character set is an "encoded character".

In Unicode, each abstract character is mapped to a scalar value in the range
0x0..0x10FFFF. This "Unicode scalar value" uniquely identifies the
character. Within that 0x0..0x10FFFF range, there are certain sub-ranges
that are not assigned to characters by the standard; they are reserved for
special functions, future extension mechanisms or private character
assignments.

Aside from the Universal Character Set shared by the Unicode Standard and
ISO 10646-1, other popular coded character sets include US-ASCII (128
abstract characters mapped to scalar values in the range 0x0..0x7F) and
ISO-8859-1 (US-ASCII plus another 96 abstract characters mapped to scalar
values in the range 0xA0..0xFF).

Here are 3 ways of representing the Unicode scalar value of the abstract
character named "ANGSTROM SIGN":
   * in a hexadecimal notation: 0x212B
   * in decimal notation: 8491
   * in EBNF notation: \v00212B

Here is a way of representing the abstract character itself, using its
scalar value:
   * in Unicode notation: U-00212B

3. Code values, or "code units", are numbers that computers use to represent
abstract objects, such as Unicode characters. Code values are typically
8-bit, 16-bit, or 32-bit wide non-negative integers. An encoded character,
or rather, the integer representing an abstract character in a coded
character set, can be mapped to a sequence of one or more code values. This
mapping is called an "encoding form".

ISO/IEC 10646-1 defines a 32-bit encoding form called UCS-4, in which each
encoded character is represented by a 32-bit code value in the code space
0x0..0x7FFFFFFF (the most significant bit is not used). This encoding form
is sufficient to represent all 0x10FFFF Unicode scalar values and then some.
There is also a new encoding form called UTF-32: a subset of UCS-4 that uses
32-bit code values in the 0x0..0x10FFFF code space. UTF-32 is not yet a
standard.

The ISO standard also defines a 16-bit encoding form called UCS-2, in which
a 16-bit code value in the code space 0x0..0xFFFF directly corresponds to an
identical scalar value, but this form is, of course, inherently limited to
representing only the first 65,536 scalar values.

The Unicode Standard and ISO/IEC 10646-1 define two more important encoding
forms: UTF-8 and UTF-16. UTF-8 algorithmically maps each Unicode scalar
value to a unique sequence of one to six 8-bit code values. UTF-16 is a
variation on UCS-2 that maps each Unicode scalar value to a unique sequence
of up to two 16-bit code values.

In UTF-16, each 16-bit code value in the 0x0..0xC7FF range and the
0xD800..0xFFFF range directly corresponds to the same scalar value, while a
"surrogate" pair of 16-bit code values algorithmically represents a single
scalar value in the range 0x010000..0x10FFFF. The first half of the pair is
always in the 0xD000..0xD7FF range, and the second half of the pair is in
the 0x0..0xFFFF range. Unicode 3.0 and ISO/IEC 10646-1;2000 have adopted the
UTF-16 mechanism as the only official usage of the 0xD000..0xD7FF scalar
range.

4. Each abstract character has one or two "Unicode values", which is the
code value or pair of code values that represent that character's scalar
value in the UTF-16 encoding form. Unicode uses a "U+xxxx" notation to
designate Unicode values. Since Unicode values are UTF-16 code values,
encoded characters with scalar values in the 0x0..0xFFFF range are
represented with one U+xxxx designation, and encoded characters with scalar
values in the 0x010000..0x10FFFF range are represented with a pair of U+xxxx
designations.

Here are various ways of representing the proposed abstract character named
"GOTHIC LETTER Q" (which will probably be assigned to the Unicode scalar
value 0x10335):
   * in Unicode notation, by its Unicode scalar value: U-010335
   * as a UCS-4 code value sequence, in hex notation: 0x00010335
   * as a UCS-2 code value sequence: illegal; out of range
   * as a UTF-16 code value sequence, in hex notation: 0xD800 0x0336
   * in Unicode notation, by its Unicode value pair: U+D800 U+0336
   * in EBNF notation: \u212B \u0336
   * as a UTF-8 code value sequence, in hex notation: 0xF0 0x90 0x8c 0xB5

5. An algorithm for converting code values to a sequence of 8-bit values
(bytes, octets) for cross-platform data exchange is a "character encoding
scheme". Encoding forms that produce 7-bit or 8-bit code value sequences
don't need additional processing, so UTF-8, for example, can be considered
to be both a character encoding form and a character encoding scheme. Other
encoding forms, however, need to have a consistent mechanism applied to
convert their 16-bit or 32-bit code value sequences to 8-bit sequences.
Unicode 3.0 has the character encoding schemes UTF-16BE and UTF-16LE for
this purpose. These work like UTF-16 but break up each code value into a
sequence of pairs of bytes, with each byte pair being either in Big Endian
order for UTF-16BE (the byte with the most significant bits comes first) or
Little Endian order for UTF-16LE.

6. A "character map" correlates an abstract character in a character
repertoire with a specific sequence of bytes. Other words for a character
map are a "character set", "charset" (as in the IANA registry), "charmap",
or sometimes "code page".

References:

The Unicode Standard, Version 3.0: ISBN 0-201-61633-5
Unicode Technical Report #17:
http://www.unicode.org/unicode/reports/tr17/#Character%20Encoding%20Scheme%2
0(CES)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT