Re: Default endianness of Unicode, or not

From: Mark Davis (mark@macchiato.com)
Date: Sat Apr 13 2002 - 18:19:42 EDT


Part of the problem is that the term "UTF-16" means two different
things. Let me see if I can make it clearer.

Let "UTF-16M" refer to the in-memory form, which is sequence of 16-bit
code units. The byte ordering is logically immaterial, since it is not
a sequence of bytes. Such a sequence does not use a BOM. The code
point sequence <U+1234 U+0061 U+10000> is represented as the UTF-16M
sequence <0x1234 0x0061 0xD800 0xDC00>.

Let "UTF-16", on the other hand, refer to only the byte-serialized
form.
The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented
as one of:
<0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless
<0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM
<0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB

UTF-16BE is a serialization of UTF-16M into bytes.
The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented
as:
<0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00>

UTF-16BE is a serialization of UTF-16M into bytes.
The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented
as:
<0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC>

Note: if you have a code point starting with U+FEFF (e.g. <U+FEFF
...>, it is represented as:
UTF-16M: <0xFEFF ...>
UTF-16BE: <0xFE 0xFF ...>
UTF-16LE: <0xFF 0xFE ...>
UTF-16: <0xFF 0xFE 0xFF 0xFE ...> OR <0xFE 0xFF 0xFE 0xFF ...>

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: <unicode@unicode.org>
Cc: "Kenneth Whistler" <kenw@sybase.com>; <yves@realnames.com>
Sent: Saturday, April 13, 2002 11:42
Subject: Re: Default endianness of Unicode, or not

> On Wednesday 2002-04-10, Kenneth Whistler <kenw@sybase.com> wrote:
>
> > There, feel better?
>
> Not really. I'm getting the sense on one hand that UTF-16, sans
BOM,
> can be big-endian or little-endian depending on the platform, on the
> other hand that little-endian UTF-16 isn't "legal" unless it has a
BOM,
> and on the third hand (!) that all this still hasn't been fully
thought
> out.
>
> (In the following text, I will deliberately spell out "big-endian"
and
> "little-endian" instead of using the handy abbreviations "BE" and
"LE,"
> because those refer to the specifically defined encoding schemes
> UTF-16BE and UTF-16LE and I don't always mean to do that.)
>
> > * In UTF-16, <004D 0061 0072 006B> is serialized as
> > <FF FE 4D 00 61 72 00 6B 00>, <FE FF 00 4D 00 61 00 72 00 6B>, or
> > <00 4D 00 61 00 72 00 6B>.
> >
> > The third instance cited above is the *unmarked* case -- what
> > you get if you have no explicit marking of byte order with the BOM
> > signature. The contrasting byte sequence <4D 00 61 72 00 6B 00>
> > would be illegal in the UTF-16 encoding scheme.
>
> You mean because of the missing 00 byte? (Rim shot.)
>
> > [It is, of course,
> > perfectly legal UTF-16LE.]
>
> I don't know, looks to me like a perfectly good sequence of four CJK
> ideographs. (Rim shot.)
>
> No, but seriously, folks. Can we interpret the UTF-16 encoding
> *scheme* -- we're not talking about *form* here, since that has
nothing
> to do with byte order -- as being platform-endian, or does it
absolutely
> have to be big-endian? Because if it has to be big-endian, even on
a
> little-endian platform, then there's an awful lot of non-conformant
> "UTF-16" lurking around in Windows NT (e.g. NTFS filenames).
>
> > The intent of all this is if you run into serialized UTF-16 data,
> > in the absence of any other information, you should assume and
> > interpret it as big-endian order. The "other information" (or
> > "higher-level protocol") could consist of text labelling (as
> > in MIME labels) or other out-of-band information. It could even
> > consist of just knowing what the CPU endianness of the platform
> > you are running on is (e.g., knowing whether you are compiled
> > with BYTESWAP on or off :-) ). And, of course, it is always
> > possible for the interpreting process to perform a data heuristic
> > on the byte stream, and use *that* as the other information to
> > determine that the byte stream is little-endian UTF-16 (i.e.
> > UTF-16LE), rather than big-endian.
>
> That's quite different from Yves' original statement that "UTF-16 is
> big-endian unless a BOM is present."
>
> > And a lot of the text in the standard about being neutral between
> > byte orders is the result of the political intent of the standard,
> > way back when, to deliberately not favor either big-endian or
> > little-endian CPU architectures, and to allow use of native
> > integer formats to store characters on either platform type.
>
> This is a bit troubling. It seems to imply that the decision "way
back
> when" to be neutral about byte order was merely a political gesture
to
> get the little-endian guys on board, and that the rules are changing
> somewhat to favor the big-endian guys.
>
> > Again, as for many of these kinds of issues being discovered by
> > the corps of Unicode exegetes out there, part of the problem is
> > the distortion that has set in for the normative definitions in
> > the standard as Unicode has evolved from a 16-bit encoding to
> > a 21-bit encoding with 3 encoding forms and 7 encoding schemes.
>
> No argument there. There are still plenty of common-man
> interpretations, and plenty of text in TUS 3.0, that treat UTF-16 as
the
> "one true" encoding form of Unicode. I know this is being cleaned
up
> for 4.0; I just hope public perceptions will follow.
>
> > For the UTF-16 character encoding *form*:
> >
> > "D32 <ital>UTF-16 character encoding form:</ital> the Unicode
> > CEF which assigns each Unicode scalar value in the ranges U+0000..
> > U+D7FF and U+E000..U+FFFF to a single 16-bit code unit with the
> > same numeric value as the Unicode scalar value, and which assigns
> > each Unicode scalar value in the ranges U+10000..U+10FFFF to a
> > surrogate pair, according to Table 3-X.
> >
> > * In UTF-16, <004D, 0430, 4E8C, 10302> is represented as
> > <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds
> > to U+10302."
>
> Fine. I don't think there are any questions concerning UTF-16 as a
CEF.
>
> > For the UTF-16 character encoding *scheme*:
> >
> > "D43 <ital>UTF-16 character encoding scheme:</ital> the Unicode
> > CES that serializes a UTF-16 code unit sequence as a byte sequence
> > in either big-endian or little-endian format.
> >
> > * In UTF-16 (the CES), the UTF-16 code unit sequence
> > <004D 0430 4E8C D800 DF02> is serialized as
> > <FE FF 00 4D 04 30 4E 8C D8 00 DF 02> or
> > <FF FE 4D 00 30 04 8C 4E 00 D8 02 DF> or
> > <00 4D 04 30 4E 8C D8 00 DF 02>."
>
> Here the draft text is saying in the description that UTF-16 can be
> either big-endian or little-endian, and can include a BOM or omit
it.
> Four possibilities. Good. But then the examples leave out the
non-BOM
> little-endian serialization, which implies it is not conformant like
the
> other three. Not so good, because (a) the description and examples
> don't really match and (b) the examples rule out the possibility of
> UTF-16 text that we might know darn well to be little-endian, not
> because of a BOM but perhaps because of the other indicators Ken
> mentioned: MIME labeling, knowledge of the originating platform,
> heuristics, etc.
>
> The exegesis continues....
>
> -Doug Ewell
> Fullerton, California
>
>
>
>



This archive was generated by hypermail 2.1.2 : Sat Apr 13 2002 - 16:40:29 EDT