Default endianness of Unicode, or not

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Apr 09 2002 - 23:13:09 EDT

Previous message: Kenneth Whistler: "Re: MS/Unix BOM FAQ again (small fix)"
In reply to: Yves Arrouye: "RE: MS/Unix BOM FAQ again (small fix)"
Next in thread: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Next in thread: Kenneth Whistler: "Re: MS/Unix BOM FAQ again (small fix)"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Kenneth Whistler: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Kenneth Whistler: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Mark Davis: "Fw: Default endianness of Unicode, or not"
Reply: Kenneth Whistler: "Re: Default endianness of Unicode, or not"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Yves Arrouye <yves@realnames.com> wrote:

> The last time I read the Unicode standard UTF-16 was big endian
> unless a BOM was present, and that's what I expected from a UTF-16
> converter.

Conformance requirement C2 (TUS 3.0, p. 37) says:

"The Unicode Standard does not specify any order of bytes inside a
Unicode value."

In Section 2.7, the passage on page 28 titled "Byte Order Mark (BOM)"
says:

"... Ideally, all implementations of the Unicode Standard would follow
only one set of byte order rules, but this scheme would force one class
of processors to swap the byte order on reading and writing plain text
files, even when the file never leaves the system on which it was
created."

Section 13.6, "Specials: U+FEFF, U+FFF0-U+FFFF," again acknowledges the
potential ambiguity of byte order without indicating a preference:

"... Some machine architectures use the so-called big-endian byte order,
while others use the little-endian byte order. When Unicode text is
serialized into bytes, the bytes can go in either order, depending on
the architecture."

And Unicode Standard Annex #19, "UTF-32," Section 2, distinguishes
between UTF-32BE, UTF-32LE, and UTF-32, specifically stating that the
latter may be serialized "in either big-endian or little-endian format."
Presumably UTF-16 would be consistent with this.

I do remember reading once, somewhere, that big-endian was a preferred
default in the absence of *any* other information (including platform of
origin). But I can't find anything in the Unicode Standard to back this
up, so I'll assume for now that both byte orientations are considered
equally legitimate.

-Doug Ewell
Fullerton, California
"Little-endian" user

Previous message: Kenneth Whistler: "Re: MS/Unix BOM FAQ again (small fix)"
In reply to: Yves Arrouye: "RE: MS/Unix BOM FAQ again (small fix)"
Next in thread: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Next in thread: Kenneth Whistler: "Re: MS/Unix BOM FAQ again (small fix)"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Kenneth Whistler: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Kenneth Whistler: "RE: Default endianness of Unicode, or not"
Reply: Yves Arrouye: "RE: Default endianness of Unicode, or not"
Reply: Mark Davis: "Fw: Default endianness of Unicode, or not"
Reply: Kenneth Whistler: "Re: Default endianness of Unicode, or not"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 00:16:41 EDT