Re: Why is "endianness" relevant when storing data on disks but not when in memory? from Doug Ewell on 2013-01-05 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Sat, 5 Jan 2013 18:11:59 -0700

"Martin J. Dürst" wrote:

>> When Unicode data is in a file we would say, for example, "The file
>> contains UTF-32BE data."
>>
>> When Unicode data is in memory we would say, "There is UTF-32 data in
>> memory."
>>
>> When Unicode data is sent across the Internet we would say, "The
>> UTF-32 data was sent across the Internet."
>
> The first is correct. The second is correct. The third is wrong. The
> Internet deals with data as a series of bytes, and by its nature has
> to pass data between big-endian and little-endian machines. Therefore,
> endianness is very important on the Internet. So you would say:
>
> "The UTF-32BE data was sent across the Internet."

The larger problem here is that most civilians don't understand what is
truly meant by "UTF-32BE" and "UTF-32LE".

In general, people think these terms simply mean "big-endian UTF-32" and
"little-endian UTF-32" respectively, without the additional connotation
(defined in D99 and D100) that U+FEFF at the beginning of a stream
defined as "UTF-32BE" or "UTF-32LE" is supposed to be interpreted,
against all logic, as a zero-width no-break space.

Because of this, it's not automatically the case that "the file contains
UTF-32BE data." That statement implies that there is no initial U+FEFF,
or if there is one, that it is meant to be a ZWNBSP. You could just as
easily have a "UTF-32" file, which might have an initial U+FEFF (which
then defines the endianness of the data) or might not (which means the
data is big-endian unless a "higher-level protocol" dictates otherwise).

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Received on Sat Jan 05 2013 - 19:15:56 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 05 2013 - 19:15:57 CST