RE: Default endianness of Unicode, or not

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Apr 10 2002 - 15:49:31 EDT


Yves wrote, in response to Doug:

> > > The last time I read the Unicode standard UTF-16 was big endian
> > > unless a BOM was present, and that's what I expected from a UTF-16
> > > converter.
> >
> > Conformance requirement C2 (TUS 3.0, p. 37) says:
> >
> > "The Unicode Standard does not specify any order of bytes inside a
> > Unicode value."
>
> (I posted the previous email hastily it seems.)
>
> But wait. Same page, 3 lines below, conformance requirement C3 says:
>
> "A process shall interpret a Unicode value that has been serialized into a
> sequence of bytes by most significant byte first, in the absence of
> higher-level protocols."
>
> I read this as saying that by default the byte ordering is big endian. Don't
> you?

There is a problem here in that "by default" can be interpreted in
different ways, leading to potential confusion.

The key point is in D35, p. 47 of TUS 3.0:

* In UTF-16, <004D 0061 0072 006B> is serialized as
<FF FE 4D 00 61 72 00 6B 00>, <FE FF 00 4D 00 61 00 72 00 6B>, or
<00 4D 00 61 00 72 00 6B>.

The third instance cited above is the *unmarked* case -- what
you get if you have no explicit marking of byte order with the BOM
signature. The contrasting byte sequence <4D 00 61 72 00 6B 00>
would be illegal in the UTF-16 encoding scheme. [It is, of course,
perfectly legal UTF-16LE.]

The intent of all this is if you run into serialized UTF-16 data,
in the absence of any other information, you should assume and
interpret it as big-endian order. The "other information" (or
"higher-level protocol") could consist of text labelling (as
in MIME labels) or other out-of-band information. It could even
consist of just knowing what the CPU endianness of the platform
you are running on is (e.g., knowing whether you are compiled
with BYTESWAP on or off :-) ). And, of course, it is always
possible for the interpreting process to perform a data heuristic
on the byte stream, and use *that* as the other information to
determine that the byte stream is little-endian UTF-16 (i.e.
UTF-16LE), rather than big-endian.

And a lot of the text in the standard about being neutral between
byte orders is the result of the political intent of the standard,
way back when, to deliberately not favor either big-endian or
little-endian CPU architectures, and to allow use of native
integer formats to store characters on either platform type.

Again, as for many of these kinds of issues being discovered by
the corps of Unicode exegetes out there, part of the problem is
the distortion that has set in for the normative definitions in
the standard as Unicode has evolved from a 16-bit encoding to
a 21-bit encoding with 3 encoding forms and 7 encoding schemes.

To lift the veil again a little on the Unicode 4.0 editorial
work -- here, for example, is some suggested text that the editorial
committee is working on to clarify the UTF-16 encoding form and
the UTF-16 encoding scheme. [This text is suggested draft only,
so don't go running off claiming conformance to it yet!]

For the UTF-16 character encoding *form*:

"D32 <ital>UTF-16 character encoding form:</ital> the Unicode
CEF which assigns each Unicode scalar value in the ranges U+0000..
U+D7FF and U+E000..U+FFFF to a single 16-bit code unit with the
same numeric value as the Unicode scalar value, and which assigns
each Unicode scalar value in the ranges U+10000..U+10FFFF to a
surrogate pair, according to Table 3-X.

  * In UTF-16, <004D, 0430, 4E8C, 10302> is represented as
    <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds
    to U+10302."

For the UTF-16 character encoding *scheme*:

"D43 <ital>UTF-16 character encoding scheme:</ital> the Unicode
CES that serializes a UTF-16 code unit sequence as a byte sequence
in either big-endian or little-endian format.

  * In UTF-16 (the CES), the UTF-16 code unit sequence
    <004D 0430 4E8C D800 DF02> is serialized as
    <FE FF 00 4D 04 30 4E 8C D8 00 DF 02> or
    <FF FE 4D 00 30 04 8C 4E 00 D8 02 DF> or
    <00 4D 04 30 4E 8C D8 00 DF 02>."

   etc., etc.

There, feel better?

--Ken



This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 14:13:08 EDT