Re: CCS and CEF definitions in UTR #17

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 24 2000 - 18:24:41 EDT


Mike Brown asked:

>
> Keld Jørn Simonsen wrote:
> > The specific codes for UTF-16 extension into plane 1-16
> > is not allowed in UCS-4 (or in UTF-8 for that matter).

This simply means that 0xD800..0xDFFF are not available for encoding
characters (by themselves), because they are used in the UTF-16
extension mechanism that defines characters by pairs of these code
values.

This is effectively no different than pointing out that for an
IBM DBCS host code page, 0x0E and 0x0F are not available for encoding
characters (by themselves), because they are used in the shift
mechanism that announces the single-byte or double-byte coding forms.

>
> I'm trying to sort out a table of Unicode scalar values and their
> corresponding UTF-16, UCS-4, and UCS-2 code value sequences. After
> re-reading Keld's statement and finding some supportive evidence in the
> UTF-16 amendment, I'm getting more confused, especially after consulting UTR
> #17.

So I guess I am nominated to try to explain again.

> Here is what I want someone to tell me:
>
> 1. The set of integers in a coded character set can include integers that
> are not assigned to abstract characters.

The Unicode code space is the set of integers 0 .. 0x10FFFF. This is also
the range of Unicode scalar values.

In that code space, for a variety of reasons, there are restricted values
that *cannot* be used to encode abstract characters. In particular:

0xD800..0xDFFF are restricted to expression of surrogate pairs in UTF-16.

0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, ... 0x10FFFE, 0x10FFFF are also prohibited
            from use for encoding characters (the first two inherited
            from Unicode 1.0, and the others added by 10646-1:1993).

By strict interpretation, although not by common practice, 0..0x1F, 0x7F..
            0x9F are also restricted, and cannot be used to encode
            (graphic) characters, since they are used for control functions.

If you subtract away the restricted values from the code space, you get
the set of integers which are available in the Unicode Standard to represent
encoded characters. That is the set of assignable Unicode scalar values.
To wit:

(0..0x1F) 0x20..0x7E (0x7F..0x9F) 0xA0..0xD7FF 0xE000..0xFFFD 0x10000..0x1FFFD ...
0x100000..0x10FFFD

The Unicode Standard, Version 3.0, maps a particular repertoire of abstract
characters to a subset of those available integers, namely:

...
0020 SPACE
0021 EXCLAMATION MARK
...
FFFD REPLACEMENT CHARACTER

The Unicode Standard, Version 3.0 also *assigns* private use to the following
ranges: 0xE000..0xF8FF, 0xF0000..0xFFFFD, 0x100000..0x10FFFD (although it doesn't
advertise the latter two areas, Planes 15 and 16, respectively, but depends
on their definition in 10646-1:2000). This means that those ranges are
assigned characters, but their interpretation is left open to those who choose
to use them. If you want to be strict about it, the integer 0xE000 is assigned
to the abstract character "the first private use character" to create U+E000,
and so on to the end of the available private use space.

The Unicode Standard, Version 3.1 will presumably map a larger repertoire of
abstract characters to a subset of those available integers, namely all of
those included in Unicode 3.0, plus also:

10300 ETRUSCAN LETTER A (~ OLD ITALIC LETTER A, whichever it ends up named)
...
10330 GOTHIC LETTER AHSA
...
20000 CJK UNIFIED IDEOGRAPH-20000
...
E007F CANCEL TAG

Now how about the encoding forms (UTF-16 and UTF-8)? These map the
Unicode scalar values of the encoded character set to sequences of code units.

But here is the apparent paradox that Mike is worried about. If you are
mapping the encoded character set to sequences of code units, but
the encoded character set only contains that set of integers that has
actually been encoded, then sequences of code units that don't map *back*
to an encoded character must be illegal. But that presupposes a certain
view of the mapping.

In fact, the "mapping" is not done by making a list of encoded characters
and then providing their corresponding sequence of code units. Instead,
the encoding forms are always defined by "transformations" of the numbers
that can be expressed in terms of simple arithmetic operations. (UTF-EBCDIC
is probably the one exception that also requires a table lookup in the
transform, because of its peculiar constraints.)

So think of UTF-16 and UTF-8 as providing sequences of code units that
map to *all* of the integers in the code space. See D29 in
the Unicode Standard. Then for any particular version of the Unicode Standard,
you can make the exact list of sequences of code units (in UTF-16 or UTF-8)
that have assigned encoded characters as of that version.

The reason for this is pretty obvious. The arithmetic transforms can
be defined once and be implemented generally. They don't have to be
uprooted and redefined every time new encoded characters are added to
the standard. And since the Unicode Standard has an open repertoire,
new encoded characters will continue to be added to it into the
indefinite future.

TR17 defines CEF as follows:

"A character encoding form is a mapping from the set of integers used in a
CCS to the set of sequences of code units."

The potential ambiguity here is what "used in" means, I suppose. The
intention was not to claim that only those integers that had been *assigned*
to encoded characters were mapped, but rather the entire set of code points
available for use in assignment to encoded characters.

This can be understood if you look away from Unicode, which has unique
CEF's (UTF-16 and UTF-8) that apply only to it, and consider instead
some of the other CEF's mentioned in TR17. For example, "8-bit" is a
trivial CEF that maps integers in the range 0..0xFF into 8-bit bytes,
with each character represented by a single byte. Obviously such a definition
of a CEF cannot be dependent on which particular abstract characters
are assigned or unassigned in each of the 100's of CCS's for which
the "8-bit" CEF applies.

>
> 2. Code unit sequences defined by a character encoding form can map to
> integers that are part of a coded character set but that have not been
> assigned to abstract characters.

2. Code unit sequences defined by a character encoding form can map to
integers that are part of THE AVAILABLE CODE SPACE but that have not been
assigned to abstract characters (and thus by definition are not encoded
characters).

--Ken

>
> - Mike
> ___________________________________________________________
> Mike J. Brown, software engineer, Webb Interactive Services
> XML/XSL stuff: http://www.skew.org/ http://www.webb.net/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT