Re: Default endianness of Unicode, or not

From: Mark Davis (mark@macchiato.com)
Date: Sun Apr 14 2002 - 19:13:48 EDT


If UTF-16 (serialized) without a BOM, could be in either order, then
the interpretation would be indeterminate. If you want to output <0x34
0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> then tag it as UTF-16BE, not just
UTF-16.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Mark Davis" <mark@macchiato.com>; <unicode@unicode.org>
Cc: "Kenneth Whistler" <kenw@sybase.com>; <yves@realnames.com>
Sent: Sunday, April 14, 2002 15:28
Subject: Re: Default endianness of Unicode, or not

> Mark Davis <mark@macchiato.com> wrote:
>
> > Part of the problem is that the term "UTF-16" means two different
> > things. Let me see if I can make it clearer.
> >
> > Let "UTF-16M" refer to the in-memory form, which is sequence of
16-
> > bit code units. The byte ordering is logically immaterial, since
it
> > is not a sequence of bytes. Such a sequence does not use a BOM.
The
> > code point sequence <U+1234 U+0061 U+10000> is represented as the
> > UTF-16M sequence <0x1234 0x0061 0xD800 0xDC00>.
> >
> > Let "UTF-16", on the other hand, refer to only the byte-serialized
> > form.
>
> I think I understand the difference between the CEF called "UTF-16"
and
> the CES called "UTF-16." That isn't where I'm having a problem.
>
> > The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is
represented
> > as one of:
> > <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless
> > <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM
> > <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB
>
> *This* is where I'm having a problem. Mark states here, again, that
> BOM-less UTF-16 (the CES) must be big-endian. That is:
>
> <0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOBless
>
> is not an instance of any valid CES. That, to me, is a change from
what
> Unicode has stated before, and from what Ken just said about using
> "other information" (which could include external tagging, knowledge
of
> the originating platform, or heuristics) to determine the intended
byte
> order.
>
> Remember, I like the BOM. I happen to think it's a useful indicator
of
> both file type and byte order (not really two different topics).
But I
> do think the official deprecation, or omission from mention, of
BOM-less
> little-endian UTF-16 is a change from past definitions that renders
> nonconformant a potentially large amount of existing UTF-16 data.
>
> -Doug Ewell
> Fullerton, California
>
>
>



This archive was generated by hypermail 2.1.2 : Sun Apr 14 2002 - 17:32:21 EDT