Re: Default endianness of Unicode, or not

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 15 2002 - 15:53:47 EDT


Doug responded to Mark's clarification:

> > The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented
> > as one of:
> > <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless
> > <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM
> > <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB
>
> *This* is where I'm having a problem. Mark states here, again, that
> BOM-less UTF-16 (the CES) must be big-endian. That is:
>
> <0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOBless
>
> is not an instance of any valid CES. That, to me, is a change from what
> Unicode has stated before, and from what Ken just said about using
> "other information" (which could include external tagging, knowledge of
> the originating platform, or heuristics) to determine the intended byte
> order.

So let me attempt the clarification from another angle (and
without using Mark's introduction of yet more terminology to
get in the way ;-) ).

Suppose you have UTF-16 (CEF) data <1234 0061 D800 DF00>
(I adjusted the SMP character, to make it actually use an assigned
character, i.e., the character sequence: <U+1234, U+0061, U+10300>)

You want to serialize that data. There are four valid options:

1. <12 34 00 61 D8 00 DF 00>
2. <34 12 61 00 00 D8 00 DF>
3. <FE FF 12 34 00 61 D8 00 DF 00>
4. <FF FE 34 12 61 00 00 D8 00 DF>

A. If you emit (1), you can legally label it UTF-16BE or UTF-16.
B. If you emit (2), you can legally label it UTF-16LE.
C. If you emit (3) or (4), you can legally label it UTF-16.

If you depart from the recommendations of (A), (B), and (C), then
you have mislabeled your serialized data, and are not in compliance
with the standard.

Now let's turn things around. You received serialized Unicode data
in the absence of a higher-level protocol (i.e., you don't have a
valid label or other context to depend on for byte order).

A. If you receive (1), it is illegal as UTF-8 or UTF-32, and could
   only be interpreted as the UTF-16 code unit sequence:
   <1234 0061 D800 DF00>. You *assume* big-endian.

B. If you receive (2), it is illegal as UTF-8 or UTF-32, and could
   only be interpreted as the UTF-16 code unit sequence:
   <3412 6100 00D8 00DF>. You *assume* big-endian.

C. If you receive (3), it is illegal as UTF-8 or UTF-32, and could
   only be intrepreted as the UTF-16 code unit sequence:
   <1234 0061 D800 DF00>. You *deduce* big-endian from the BOM.

D. If you receive (4), it is illegal as UTF-8 or UTF-32, and could
   only be intrepreted as the UTF-16 code unit sequence:
   <1234 0061 D800 DF00>. You *deduce* little-endian from the BOM.

Case (B), of course, is not what we expected, but that is in fact
what the standard *requires* you to do, in the absence of a higher-level
protocol. (Note that all four code units that result do in fact
correspond to valid, encoded characters -- two Han characters,
followed by Ø and ß.)

If, on the other hand, you *did* have a higher-level protocol, e.g.,
you knew that you had received (2) *and* it had a label of UTF-16LE
(or you were running as a dedicated little-endian API on a Windows
system, or you received the output of a heuristic analysis of the
last piece of text received, and so on), then you could take that
outside information that specified (2) to be little-endian and
interpret the byte sequence as the serialization of <1234 0061
D800 DF00> instead.

Summary:

There are 3 encoding schemes for UTF-16 data: "UTF-16", "UTF-16BE",
and "UTF-16LE".

There are two allowable orders for serialization, either of which
can be preceded by a BOM. (i.e., 1, 2, 3, 4 above)

In the absence of a higher-level protocol, and in the absence of a
BOM, big-endian order is assumed for a serialization. That is the
source of the asymmetry noted above, and is the nature of the
'preference' for big-endian byte order.

--Ken



This archive was generated by hypermail 2.1.2 : Mon Apr 15 2002 - 14:18:42 EDT