Re: U+xxxx, U-xxxxxx, and the basics

From: Keld Jørn Simonsen (keld@dkuug.dk)
Date: Wed Mar 08 2000 - 08:12:07 EST


On Fri, Mar 03, 2000 at 04:57:21PM -0800, Mike Brown wrote:
>
> The mapping of abstract characters from a character repertoire to integers
> in a code space is called a "coded character set". Other names for such
> mappings are "character encoding", "coded character repertoire", "character
> set definition", or "code page". Each abstract character in a coded
> character set is an "encoded character".

"character encoding" alsoincludes a transformation format or coded character
set shifting techniques, such as ISO 2022. So delete that term here.

> In Unicode, each abstract character is mapped to a scalar value in the range
> 0x0..0x10FFFF. This "Unicode scalar value" uniquely identifies the
> character. Within that 0x0..0x10FFFF range, there are certain sub-ranges
> that are not assigned to characters by the standard; they are reserved for
> special functions, future extension mechanisms or private character
> assignments.

Is it so? Last time I looked Unicode characters did not go beyond 0xFFFF
Then "surrogates" were defined as characters, and two surrogates
could be joined to form something else (this terminology
was in conflict with 10646).

> Aside from the Universal Character Set shared by the Unicode Standard and
> ISO 10646-1, other popular coded character sets include US-ASCII (128
> abstract characters mapped to scalar values in the range 0x0..0x7F) and
> ISO-8859-1 (US-ASCII plus another 96 abstract characters mapped to scalar
> values in the range 0xA0..0xFF).

iso/iec 8859-1 does not include all of us-ascii, as the controls 0-31 and 127
are not in this standard. The IETF charset iso-8859-1 includes
both the us-ascii controls and the C1 control characters of ISO/IEC 6429.
>
> * in decimal notation: 8491
> * in EBNF notation: \v00212B

>
> Here is a way of representing the abstract character itself, using its
> scalar value:
> * in Unicode notation: U-00212B

Hmm. normally only 4 hex, U-212B or U212B
>
> 3. Code values, or "code units", are numbers that computers use to represent
> abstract objects, such as Unicode characters. Code values are typically
> 8-bit, 16-bit, or 32-bit wide non-negative integers. An encoded character,
> or rather, the integer representing an abstract character in a coded
> character set, can be mapped to a sequence of one or more code values. This
> mapping is called an "encoding form".

What is "code values" ? Bytes/octets? Normally you would only map
the integer to one value.
>
>
> The Unicode Standard and ISO/IEC 10646-1 define two more important encoding
> forms: UTF-8 and UTF-16. UTF-8 algorithmically maps each Unicode scalar
> value to a unique sequence of one to six 8-bit code values. UTF-16 is a
> variation on UCS-2 that maps each Unicode scalar value to a unique sequence
> of up to two 16-bit code values.

It is actually 10646 canonical codepoints that are mapped this way, at
least for your description of UTF-8. Unicode scalar codes can all be mapped
to a shorter string code than 6 octets.
>
> In UTF-16, each 16-bit code value in the 0x0..0xC7FF range and the
> 0xD800..0xFFFF range directly corresponds to the same scalar value, while a
> "surrogate" pair of 16-bit code values algorithmically represents a single
> scalar value in the range 0x010000..0x10FFFF. The first half of the pair is
> always in the 0xD000..0xD7FF range, and the second half of the pair is in
> the 0x0..0xFFFF range. Unicode 3.0 and ISO/IEC 10646-1;2000 have adopted the
> UTF-16 mechanism as the only official usage of the 0xD000..0xD7FF scalar
> range.

It is ISO/IEC 10646-1:2000 (note colon instead of semicolon).
Previous versions of 10646 also had UTF-16 in there (since AMD 4 I think).
>
> 4. Each abstract character has one or two "Unicode values", which is the
> code value or pair of code values that represent that character's scalar
> value in the UTF-16 encoding form. Unicode uses a "U+xxxx" notation to
> designate Unicode values. Since Unicode values are UTF-16 code values,
> encoded characters with scalar values in the 0x0..0xFFFF range are
> represented with one U+xxxx designation, and encoded characters with scalar
> values in the 0x010000..0x10FFFF range are represented with a pair of U+xxxx
> designations.

This is different from 10646, which only have one canonical value
for an abstract character. Maybe you can say that too.

> Here are various ways of representing the proposed abstract character named
> "GOTHIC LETTER Q" (which will probably be assigned to the Unicode scalar
> value 0x10335):
> * in Unicode notation, by its Unicode scalar value: U-010335

Always 4 or 8 hex in a "U" name.

> * as a UCS-4 code value sequence, in hex notation: 0x00010335
> * as a UCS-2 code value sequence: illegal; out of range
> * as a UTF-16 code value sequence, in hex notation: 0xD800 0x0336
> * in Unicode notation, by its Unicode value pair: U+D800 U+0336
> * in EBNF notation: \u212B \u0336
> * as a UTF-8 code value sequence, in hex notation: 0xF0 0x90 0x8c 0xB5
>
> 5. An algorithm for converting code values to a sequence of 8-bit values
> (bytes, octets) for cross-platform data exchange is a "character encoding
> scheme". Encoding forms that produce 7-bit or 8-bit code value sequences
> don't need additional processing, so UTF-8, for example, can be considered
> to be both a character encoding form and a character encoding scheme. Other

not so. UTF-8 always implies both the algoritm and the specific codes.
So it is not a character encoding scheme.

> encoding forms, however, need to have a consistent mechanism applied to
> convert their 16-bit or 32-bit code value sequences to 8-bit sequences.
> Unicode 3.0 has the character encoding schemes UTF-16BE and UTF-16LE for
> this purpose. These work like UTF-16 but break up each code value into a
> sequence of pairs of bytes, with each byte pair being either in Big Endian
> order for UTF-16BE (the byte with the most significant bits comes first) or
> Little Endian order for UTF-16LE.

UTF-16LE and UTF-16BE are not encoding schemes either.

> 6. A "character map" correlates an abstract character in a character
> repertoire with a specific sequence of bytes. Other words for a character
> map are a "character set", "charset" (as in the IANA registry), "charmap",
> or sometimes "code page".

Please do not promote the term "character set" in this relation.

charmap is a POSIX term.

> References:
>
> The Unicode Standard, Version 3.0: ISBN 0-201-61633-5

Maybe also:
ISO/IEC 10646-1:2000 Universal Character Set (UCS) -Part 1 bla bla.

> Unicode Technical Report #17:
> http://www.unicode.org/unicode/reports/tr17/#Character%20Encoding%20Scheme%2
> 0(CES)

Kind regards
Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT