Re: Why is "endianness" relevant when storing data on disks but not when in memory?

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Sun, 06 Jan 2013 02:57:16 +0100

Doug Ewell, Sat, 5 Jan 2013 18:11:59 -0700:
> "Martin J. Dürst" wrote:

>>> When Unicode data is sent across the Internet we would say, "The
>>> UTF-32 data was sent across the Internet."
>>
>> The first is correct. The second is correct. The third is wrong.
>> [ snip ] you would say:
>>
>> "The UTF-32BE data was sent across the Internet."
>
> The larger problem here is that most civilians don't understand what
> is truly meant by "UTF-32BE" and "UTF-32LE".
>
> In general, people think these terms simply mean "big-endian UTF-32"
> and "little-endian UTF-32" respectively, without the additional
> connotation (defined in D99 and D100) that U+FEFF at the beginning of
> a stream defined as "UTF-32BE" or "UTF-32LE" is supposed to be
> interpreted, against all logic, as a zero-width no-break space.

(I agree that it is against logic.)

> Because of this, it's not automatically the case that "the file
> contains UTF-32BE data." That statement implies that there is no
> initial U+FEFF, or if there is one, that it is meant to be a ZWNBSP.
> You could just as easily have a "UTF-32" file, which might have an
> initial U+FEFF (which then defines the endianness of the data) or
> might not (which means the data is big-endian unless a "higher-level
> protocol" dictates otherwise).

I believe that even the U+FEFF *itself* is either UTF-32LE or UTF-32BE.
Thus, there is, per se, no implication of lack of byte-order mark in
Martin’s statement. Assuming that the label "UTF-32" is defined the
same way as the label "UTF-16", then it is an umbrella label or a
"macro label" (hint: macro language) which covers the two *real*
encodings - UTF-32LE and UTF-32BE.

Just my 5 øre.

-- 
leif halvard silli
Received on Sat Jan 05 2013 - 20:00:51 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 05 2013 - 20:00:52 CST