RE: MS/Unix BOM FAQ again (small fix)

From: George W Gerrity (ggerrity@dragnet.com.au)
Date: Sat Apr 13 2002 - 07:17:09 EDT

Previous message: Shlomi Tal: "Re: Please help: Unicode sig in Hotmail"
In reply to: Lars Kristan: "RE: MS/Unix BOM FAQ again (small fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 23:27 -0700 2002-04-11, Doug Ewell wrote:
>George W Gerrity <ggerrity@dragnet.com.au> wrote:
>
>> To expand on this, imagine there is a text file in some encoding on
>> some medium created by a little-endian machine (say a DEC Vax or a
>> Macintosh 68000), and it is to be accessed on a big-endian machine
>> (any Intel 8080 -- Pentium architecture).
>
>This doesn't answer your main question, but: You've got your
>terminology backward. Architectures that store the most significant
>byte first, like the Vax and Macintosh, are called "big-endian," while
>those that store the least significant byte first, like the Intel
>series, are called "little-endian."

At 09:39 -0700 2002-04-12, Andy Heninger wrote:
>Just to set the historical record straight,
>
>Little Endian Machines include
> DEC VAX (and PDP-11 before it)
> Intel x86 (and 8080 before it)
>
>Big Endian Machines include
> Macintosh, both 68000 and PowerPC
>
> -- Andy Heninger
> heninger@us.ibm.com

At 10:10 +0300 2002-04-12, Paul Keinanen wrote:
>On Fri, 12 Apr 2002 11:38:55 +1000, George W Gerrity
><ggerrity@dragnet.com.au> wrote:
>
>
>>To expand on this, imagine there is a text file in some encoding on
>>some medium created by a little-endian machine (say a DEC Vax or a
>>Macintosh 68000), and it is to be accessed on a big-endian machine
>>(any Intel 8080 -- Pentium architecture).
>
>You are mixing up things, DEC PDP/VAX and Intel x86 are both little
>endian, while Motorola 68000 is big-endian.

#1. Quite right. I haven't taught Computer Architecture for a number
of years, and when I retired, I gave away all my texts on that
subject, so couldn't check my (obviously failing) memory. However,
your (Doug Ewell's) definition is confirmed in Comer, "Computer
Networks and Internets, 2nd ed, p.479. In any case, I was trying to
illustrate with an example where the source and destination
architectures had differing endian-ness.

> >Unless the two CPUs are
>>sharing the same RAM in order to share the file data in that RAM, the
>>data will have to be accessed by reading some storage medium, such as
>>mag tape, floppy disc, hard disc, CD-ROM, etc, or by some file
>>transfer method on a network. _All_ of these accessing methods are
>>either bit-serial or byte-serial,
>
>I have used 16 bit wide disks and disk controllers, although not very
>recently :-).

#2. Yes, but it is a _byte_ stream that is delivered to the
controllers from the head decoders, and the hardware interface fixes
the endian-ness for the target architecture. This _always_ gives the
correct RAM image whatever length in bytes of the underlying data.

> >transmitting the most significant
>>bit of the most significant byte first, and the little/big-endian
>>storage in the RAM receiving buffers is done correctly by the target
> >machine. True, the low-level programming in a portable OS such as
> >*NIX, say, has to take cognizance of endian-ness, but even that is
> >pretty sparse.
>
>There is exactly the same problem as with the end of line (eoln)
>convention, you have to know when to swap (i.e. text files) and when
>not to swap (all other files).

#3. I beg to differ. It has nothing to do with endian-ness. The point
here is that if FTP is informed that the file is ASCII text, then,
the receiving side will search the incoming character stream for LF
and CR, and will replace them with the correct eoln marker for the
destination machine. Otherwise, it keeps its cotton-picking fingers
off the data stream and delivers it in byte (address) order, which
will be correct no matter what the endian-ness.

>The file system in most operating systems (MS-DOS / Windows / Unix)
>are quite primitive, with very little information about the file
>type in the file system. VMS keeps some information about the file
>structure in the index file, but not necessary the information
>required for handling eoln or swapping issues.

#4. Primitiveness is irrelevant. Precisely my point is that at the
level of byte alignment, data type is irrelevant, unknown, and
unknowable. If both the source and destination architectures transfer
to byte- or bit-serial storage or data transfer, even using multibyte
transfer registers, all existing controllers (to my knowledge)
serialise it _as if_ the data transfer were occuring as an indexed
array of bytes, from most to least significant _in the endian-ness of
the host_, since this is _always_ known by the controller.

>In FTP this is quite easy, since the _user_ has to tell when to make a
>verbatim copy ("binary" mode) and when to convert the end of line
>convention ("ascii" mode). Apparently, in order to handle Unicode
>transfers, the user would have to tell if the data is "binary",
>"ascii" (including single byte code pages and UTF-8), UTF-16 or
>UTF-32.

#5. See comment #3.

>Apparently in a web browsing session when the user requests an FTP
>file download, some heuristics (e.g. file name extensions) are used to
>determine if a verbatim FTP transfer should be made or if some eoln
>processing has to be done.

#6. My understanding is that all Content-Type: text/plain;
charset=ASCII or unmarked mime type data transfer for http obeys
telnet rules, ie, an eoln is CR LF. If you wish to use any UTF
encoding, you need to use another mime type (eg, Content-Type:
text/plain; charset=UTF-8) to prevent this happening.

>A much more complicated heuristics would have to be used to
>determine if the file is plain text in UTF-16 or UTF-32 format.
>
>The same problems still apply when disks from various computers are
>mapped into a single computer.

#7. Could be. I've never run into a situation where computers of two
different types shared the same controller: indeed, only in a
multi-processor architecture does it make sense to share a controller.

>For truly transparent file access, each file in each disk must carry
>some information if this is a non-plain text file ("binary), a plain
>text file, a UTF-16 file or a UTF-32 file. Unfortunately, the
>primitive file systems in the most popular operating systems
>(Windows/unixes) does not carry this information, so in my opinion,
>all UTF-16/UTF-32 should contain a BOM, to help the heuristics to
>determine the file type.

#8. No. Let's say two machines of different endian-ness share a disk
using separate (say, SCSI) controllers, or even share a (SCSI) bus.
In both cases, the controller interface delivers the bytes in the
correct order. The _controller_ knows the endian-ness of its host and
the data delivery format. Data type is irrelevant.

Moreover, even that type of scenario is uncommon. The usual case is
for mixed OSs and/or architectures to share a file server over a LAN.
In this case, the server interface acts as an even further layer for
delivering compatability, usually by automatically mapping file names
and ASCII text eolns on the fly, which it can do, since it knows
source and destination OS conventions. Once again, endian-ness is not
a concern.

At 14:00 +0200 2002-04-12, Lars Kristan wrote:
>George W Gerrity wrote:
>> _All_ of these accessing methods are
>> either bit-serial or byte-serial, transmitting the most significant
>> bit of the most significant byte first, and the little/big-endian
>> storage in the RAM receiving buffers is done correctly by the target
>> machine.
>
>As for bits, it can be either most significant first or least significant
>first, but this is only important in serial communication. And
>fortunately, there are many reasons why we usually don't need to
>bother with the choice for this.

#9. True. But in any case, do you know of any serial format that
doesn't transmit most-significant first? I don't.

>As for bytes - well, which is the most significant byte?! Your
>statement is wrong there. Transmission is done by memory address,
>not by significance. Low to high, of course.

#10. Not quite correct, but see below.

>The fact that processor architectures store data types larger that 8
>bits in two ways is what causes the problem here. And if it wasn't
>for performance, the byte order could be standardized.

#11. All transmission, either to a network controller or to a disk or
tape controller (or even RAM, for that matter), involves a (possibly
hidden) hardware register, which may be one or many bytes in size.
Transfer to or from that H/W register is by memory address, as it
must be, and if we imagined an identical controller used for machines
of different endian-ness, then of course there would be a horrible
stuffup. However, underneath this H/W register is a bit or byte
stream due to the structure of the storage or transmission medium. My
understanding is that the controller, knowing the endian-ness of the
host, maps the hardware register in such a way that transfer order
would be the same as if byte addressing _on the host_ were used, ie,
storage (or transmission) order is independent of transmitted soft-or
hardware buffer size, or endian-ness. If this were not so, then even
two different programs on the same machine and OS, accessing data
with different sized software buffers, would be unable to read each
other's data. Moreover, two machines with different word sizes but
the same architecture -- as is the case for most CPU families today
-- would be unable to share a binary file on a shared disk. To my
knowledge, this never happens because the controllers ensure correct
delivery order regardless of CPU chunk size.

I suppose a simple test could demonstrate this. Both Windows and Macs
use IEEE standard floating-point numbers. If you write a C program to
generate an array file of random 4-byte FP numbers on a Windows
machine, then FTP it as binary data across an Ethernet to a
Macintosh. I believe that a common C printout program will generate
the same text (subject to rounding error) on both machines.
Unfortunately, I haven't access to a windows machine at the moment,
so someone else would have to do it for us.

At 08:52 -0600 2002-04-12, John H. Jenkins wrote:
>On Thursday, April 11, 2002, at 07:38 PM, George W Gerrity wrote:
>
>>"Considering the difficulty af actually getting access to a file in
>>such a manner that the 'endian-ness' of the computer architecture
>>is NOT transparent, why do we even need a byte-order mark?"
>>
>
>Actually, this isn't at all difficult. Macs read, for example, both
>PC and Mac formatted devices such as floppies (remember them?) and
>Zip disks.
> As it happens, the last bunch of floppies we bought at my house
>were preformatted for PC's, and being lazy we haven't reformatted
>them when we use them for Macs. A pure file on such a disk could
>easily be a PC file created by a Windows app or a Mac file created
>by a Mac app. There's no way of telling based on the media itself
>from which architecture the file came.

#12. Are you agreeing with me, then? I do this all the time at work,
and the transfer is completely compatible.

At 09:28 -0700 2002-04-12, Markus Scherer wrote:
>George W Gerrity wrote:
>
>To expand on this, imagine there is a text file in some encoding on
>some medium created by a little-endian machine (say a DEC Vax or a
>Macintosh 68000), and it is to be accessed on a big-endian machine
>(any Intel 8080 -- Pentium architecture). Unless the two CPUs are
>sharing the same RAM
>
>(Doug set the endiannesses straight.)
>
>in order to share the file data in that RAM, the data will have to
>be accessed by reading some storage medium, such as mag tape, floppy
>disc, hard disc, CD-ROM, etc, or by some file transfer method on a
>network. _All_ of these accessing methods are either bit-serial or
>byte-serial, transmitting the most significant bit of the most
>significant byte first, and the little/big-endian storage in the RAM
>receiving buffers is done correctly by the target machine. True, the
>low-level programming in
>
>
>Well, no, the target machine cannot 'magically do it correctly',
>that's why this is an issue not only for Unicode but for all
>protocols and file formats that use 16-bit-and-larger units.
>The source machine byte-serializes such units some way, and if there
>is no way to tell the byte order (by protocol, format definition, or
>flag in the byte stream) then the target machine may get garbage.

#13. Again, I have been programming on multiple architectures for
some time, and this problem is hard to find. The reason that the
target machine can read the data is that its hardware interface knows
the (internal standard) serialised byte order on the media, and the
interface arranges these bytes correctly for the endian-ness of its
host machine, no matter which direction the data is transferred. This
is true even when transferring from the CPU to RAM on an internal
bus. I remember the detail that both DEC and Motorola went into to
explain to designers how to ensure that H/W transfer registers
(accessed as memory in this case) needed to be constructed to ensure
byte-mapping consistency for all hardware-supported data sizes. As
mentioned above, if this does not occur correctly, then even machines
with the same basic architecture but different buffer sizes, will not
be able to share a disc drive, for instance.

At 03:26 +0900 2002-04-13, Dan Kogai wrote:
>On Friday, April 12, 2002, at 10:38 , George W Gerrity wrote:
>>To expand on this, imagine there is a text file in some encoding on
>>some medium created by a little-endian machine (say a DEC Vax or a
>>Macintosh 68000), and it is to be accessed on a big-endian machine
>>(any Intel 8080 -- Pentium architecture).
>
> I KNOW someone will pick this nit sooner or later but since I
>recently implemented all UTF-(16|32)(LE|BE)? for Encode, a perl
>module (also the biggest in size thanks to CJK Unification :) comes
>will every Perl 5.8) that does all the Unicode-(from|to)-X
>transcoding, it would be appropriate for me to do so.
> Your endianness for MC 680x0 and IA-32 is upside down. 68k is in
>network byte order and IA-32 is in VAX byte order.

#14. See #1.

>Here is a C one-liner that tells the endianness. 4 for BE and 1 for LE.
>
> int main(){ int e=0x04030201 ; printf("%d\n", *((char *)&e)); }
>
>And of course, in perl one-liner.
>
> perl -e 'print pack("C", unpack("L", "1234")), "\n"'

#15. Have you checked that this code ever gets called except for a
test file you created? Where did you get your test data? my real
point is that I wouldn't know how to get test data without generating
it artificially.

>
>>I acknowledge that the BOM _can_ be used to differentiate between
>>various encodings -- UTF-8, UTF-16, UTF-32, non-Unicode -- but
>>then, that has _nothing_ to do with byte order. Perhaps it should
>>be renamed?
>
>It definitely has A LITTLE to do -- If BOM is the opposite of the
>endianness of your computer, flip the bytes before going anything
>further. It does not say anything about the endianness of her/his
>machine where the data is originated because any computer can choose
>to prepend both versions of the BOM, however.

#16.Again, it is obvious. What isn't obvious (to me, at least) is
that in real life this part of your code will _ever_ be exercised.

>FYI Encode module uses BE BOM when it encodes (from perl's native
>UTF-8) to UTF-16 or UTF-32 with no endianness specified, even on my
>FreeBSD box which is LE. So far there is no such option to make
>BOMmed UTF-16 or UTF-32 with little endian BOM. This decision has
>made the code much simpler and should not affect usability a bit.
>But if you don't like it, this weekend is the last chance to say so
>because code freeze is coming!
>
>Dan the BOMeed Man

George

Previous message: Shlomi Tal: "Re: Please help: Unicode sig in Hotmail"
In reply to: Lars Kristan: "RE: MS/Unix BOM FAQ again (small fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Sat Apr 13 2002 - 05:56:27 EDT