Default endianness of Unicode, or not

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Apr 09 2002 - 23:13:09 EDT


Yves Arrouye <yves@realnames.com> wrote:

> The last time I read the Unicode standard UTF-16 was big endian
> unless a BOM was present, and that's what I expected from a UTF-16
> converter.

Conformance requirement C2 (TUS 3.0, p. 37) says:

"The Unicode Standard does not specify any order of bytes inside a
Unicode value."

In Section 2.7, the passage on page 28 titled "Byte Order Mark (BOM)"
says:

"... Ideally, all implementations of the Unicode Standard would follow
only one set of byte order rules, but this scheme would force one class
of processors to swap the byte order on reading and writing plain text
files, even when the file never leaves the system on which it was
created."

Section 13.6, "Specials: U+FEFF, U+FFF0-U+FFFF," again acknowledges the
potential ambiguity of byte order without indicating a preference:

"... Some machine architectures use the so-called big-endian byte order,
while others use the little-endian byte order. When Unicode text is
serialized into bytes, the bytes can go in either order, depending on
the architecture."

And Unicode Standard Annex #19, "UTF-32," Section 2, distinguishes
between UTF-32BE, UTF-32LE, and UTF-32, specifically stating that the
latter may be serialized "in either big-endian or little-endian format."
Presumably UTF-16 would be consistent with this.

I do remember reading once, somewhere, that big-endian was a preferred
default in the absence of *any* other information (including platform of
origin). But I can't find anything in the Unicode Standard to back this
up, so I'll assume for now that both byte orientations are considered
equally legitimate.

-Doug Ewell
 Fullerton, California
 "Little-endian" user



This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 00:16:41 EDT