RE: U+xxxx, U-xxxxxx, and the basics

From: Mike Brown (mbrown@corp.webb.net)
Date: Wed Mar 08 2000 - 15:12:52 EST


I wrote:
> > The mapping of abstract characters from a character
> > repertoire to integers in a code space is called a
> > "coded character set". Other names for such
> > mappings are "character encoding" [...]

Keld wrote:
> "character encoding" alsoincludes a transformation
> format or coded character set shifting techniques,
> such as ISO 2022. So delete that term here.

My intent is to clarify issues relating to the creation of XML documents, so
I am mainly concerned with what certain terms mean in the realm of "ISO/IEC
10646-1993 ... (plus amendments AM 1 through AM 7)" as referenced by the XML
1.0 Recommendation.

Most of my statements were simply an interpretation of Unicode Technical
Report #17 and certain sections of the Unicode 3.0 book. Even though Unicode
2.x is probably more in line with the ISO spec referenced by XML, the
character encoding model described in UTR #17 and Unicode 3.0 does not seem
to introduce any concepts that are in conflict.

However, if, outside of the Unicode realm, "character encoding" means
something other than what I stated here, please post examples of that term
being used (maybe quote from ISO 2022?) so these can be considered. Both the
UTR #17 and Unicode 3.0 book make an attempt to mention what alternative and
conflicting terms for these same concepts exist in the rest of the
information industry.

My interpretation of the current Unicode model is at
http://www.skew.org/xml/tutorial/ and involves the following concepts that
are IMO relevant to the authorship, storage and transmission of XML
documents:

1. character encodings: assignment of unique numbers in a code space to
abstract characters

2. encoding forms: conversion of numbers from that code space to 8-bit,
16-bit, or 32-bit code value *sequences* -- note that some numbers in the
code space may not be assigned to characters, but they can still be
converted to code value sequences.

3. encoding schemes: conversion of code value sequences to 8-bit value
(byte) sequences

4. charsets/character maps: direct mapping of abstract characters to byte
sequences

I wrote:
> > Within that 0x0..0x10FFFF range, there are certain
> > sub-ranges that are not assigned to characters by
> > the standard; [...]

Keld wrote:
> Is it so? Last time I looked Unicode characters did
> not go beyond 0xFFFF
> Then "surrogates" were defined as characters, and two
> surrogates could be joined to form something else (this
> terminology was in conflict with 10646).

In Unicode 2.0, that was more or less true, because using the UTF-16
surrogate mechanism was not mandated by the spec. However, it would be
misleading to imply that each Unicode value correlates to a "Unicode
character", whatever that may be. I revised my interpretation of the current
model slightly to read:

"The Unicode Standard calls each of the code points in the 0x0..0x10FFFF
code space a Unicode scalar value. Each Unicode scalar value uniquely
identifies the character assigned to that code point [if such an assignment
has been made]. There are certain ranges of Unicode scalar values that are
not assigned to characters by the standard; they are reserved for special
functions, future extension mechanisms or private character assignments."

> iso/iec 8859-1 does not include all of us-ascii, as the
> controls 0-31 and 127 are not in this standard. The IETF
> charset iso-8859-1 includes both the us-ascii controls
> and the C1 control characters of ISO/IEC 6429.

Wow, thanks for that clarification. I didn't realize IETF's charset and the
ISO standard were different. I'll note that change for the next revision of
the materials.

> It is ISO/IEC 10646-1:2000 (note colon instead of semicolon).

I have seen it as semicolon (not sure where), colon, and hyphen (in the XML
1.0 Recommendation). You're right, though, colon seems to be appropriate.
Change noted.

> > UTF-8, for example, can be considered to be both a
> > character encoding form and a character encoding scheme.
>
> not so. UTF-8 always implies both the algoritm and the
> specific codes.

I don't see how that invalidates my statement.

> UTF-16LE and UTF-16BE are not encoding schemes either.

UTR #17 says otherwise.

> > Other words for a character map are a "character set",
> > "charset" (as in the IANA registry), charmap [...]
>
> Please do not promote the term "character set" in this relation.
> charmap is a POSIX term.

I am quoting almost verbatim from UTR #17.

> > References:
> Maybe also:
> ISO/IEC 10646-1:2000 Universal Character Set (UCS)

I didn't refer to it because I didn't, and still don't, have access to it
(another recent topic of discussion on the list) :-)

   - Mike
___________________________________________________________
Mike J. Brown, software engineer, Webb Interactive Services
XML/XSL stuff: http://www.skew.org/ http://www.webb.net/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT