From: Peter Constable (petercon@microsoft.com)
Date: Fri Nov 28 2003 - 12:30:43 EST
> -----Original Message-----
> From: unicore-bounce@unicode.org [mailto:unicore-bounce@unicode.org]
On Behalf
> Of Rick McGowan
> The following public review issues are new:
>
> 25 Proposed Update UTR #17 Character Encoding Model 2004.01.27
I have submitted the following comments, copied here in case anyone
wishes to discuss them:
The draft text for TR17, section 5 says, "A simple character encoding
scheme is a mapping of each code unit of a CCS into a unique serialized
byte sequence." It goes on to define a compound CES. While not stated
explicitly, Unicodes CESs do not fit the definition of a compound CES,
and so the definition for simple CES must apply.
The problem is that this definition cannot accommodate all seven Unicode
CESs. Since it defines a CES as a mapping from each code unit, there are
only two possible byte-order-dependent mappings for 16- and 32-bit code
units. In other words, the distinction between UTF-16BE and UTF-16 data
that is big-endian cannot be a CES distinction because individual code
units are mapped in exactly the same way in both cases.
A definition for simple CES must, at a minimum, refer to a mapping of
*streams* of code units if it is to include details about a byte-order
mark that may or may not occur at the beginning of a stream.
I would suggest that, in order to accommodate the UTF-16 and UTF-32
CESs, an appropriate definition should actually be a level of
abstraction away from "a mapping": a CES is a specification for
mappings. Any mapping is necessarily deterministic, giving a specific
output for each input. A mapping itself cannot serialize "in either
big-endian or little-endian format"; it must be one or the other,
unambiguously. On the other hand, a specification for how to map into
byte sequences can be ambiguous in this regard. Thus, the UTF-16 CES can
be considered a specification for mapping into byte sequences that
allows a little-endian mapping or a big-endian mapping.
Peter
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division
This archive was generated by hypermail 2.1.5 : Fri Nov 28 2003 - 13:14:50 EST