[OT] RE: MS/Unix BOM FAQ again (small fix)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Apr 13 2002 - 09:55:43 EDT


> George W Gerrity wrote:
> >As for bytes - well, which is the most significant byte?! Your
> >statement is wrong there. Transmission is done by memory address,
> >not by significance. Low to high, of course.
>
> #10. Not quite correct, but see below.
>
> >The fact that processor architectures store data types larger that 8
> >bits in two ways is what causes the problem here. And if it wasn't
> >for performance, the byte order could be standardized.
> #11. All transmission, either to a network controller or to a disk or
> tape controller (or even RAM, for that matter), involves a (possibly
> hidden) hardware register, which may be one or many bytes in size.
> Transfer to or from that H/W register is by memory address, as it
> must be, and if we imagined an identical controller used for machines
> of different endian-ness, then of course there would be a horrible
> stuffup. However, underneath this H/W register is a bit or byte
> stream due to the structure of the storage or transmission medium. My
> understanding is that the controller, knowing the endian-ness of the
> host, maps the hardware register in such a way that transfer order
> would be the same as if byte addressing _on the host_ were used, ie,
> storage (or transmission) order is independent of transmitted soft-or
> hardware buffer size, or endian-ness. If this were not so, then even
> two different programs on the same machine and OS, accessing data
> with different sized software buffers, would be unable to read each
> other's data. Moreover, two machines with different word sizes but
> the same architecture -- as is the case for most CPU families today
> -- would be unable to share a binary file on a shared disk. To my
> knowledge, this never happens because the controllers ensure correct
> delivery order regardless of CPU chunk size.
>
> I suppose a simple test could demonstrate this. Both Windows and Macs
> use IEEE standard floating-point numbers. If you write a C program to
> generate an array file of random 4-byte FP numbers on a Windows
> machine, then FTP it as binary data across an Ethernet to a
> Macintosh. I believe that a common C printout program will generate
> the same text (subject to rounding error) on both machines.
> Unfortunately, I haven't access to a windows machine at the moment,
> so someone else would have to do it for us.
>
The test you are proposing is something that is happening every day, without
you even knowing about it. I don't know about IEEE standard floating-point
numbers, perhaps the test even succeds with those. But floating-point
numbers are rarely used in communication, because - oh well, just because.
Why not a simple word, or a double word, as in Dan's example? Before we go
on, let me just tell you - the test fails.

Anyway, I do have access to both types of machines and sometimes I also need
to fix problems arising from bugs in the code that takes care that end-users
don't need to bother with endian-ness of their machines. Basically, there
are two ways of doing it:
A - Standardize the format (like for network communication, see 'man htonl')
B - Write the information about the format with the data (like BOM, "II" vs
"MM" in tiff, ...)

And, while A is simple and straightforward, it implies performance penalty
for one or the other architecture. This is particulary not desired in data
intensive applications. Large images definitely, and if ALL text is to be
Unicode, then text (since it is not just plain text files, but rather any
and all text data) also falls in this category.

Your assumption that a hardware register is involved in any (or all) I/O is
wrong again. How about DMA (Direct Memory Access)? And if any register is
actually used for I/O purposes, it definitely does not set things straight,
because care is taken that processor architecture (endian-ness) does not
affect the I/O in any way. Hence, transmission is ALWAYS done by memory
address. Your thinking would be OK if a 32-bit machine would communicate in
32-bit entities. Well, I/O channel may be 32 bits wide, but that is just for
performance. In the end, it is all about bytes. Otherwise you wouldn't be
able to read files on a 16-bit machine...

Anyway, I was hoping the discussion would end long before now:

struct endian_test_tag
{
    char abc8[4];
    short abc16[4];
    long abc32[4];
} endian_struct =

{
    {'A', 'B', 'C', 0},
    {65, 66, 67, 0},
    {65, 66, 67, 0}
};

main()
{
    int fh;
    int nbytes;
    printf(_T("sizeof(endian_struct) = %d\n"), sizeof(endian_struct));
    fh = open(_T("abc.out"), _O_CREAT | _O_TRUNC | _O_RDWR, 0666);
    nbytes = write(fh, &endian_struct, sizeof(endian_struct));
    printf(_T("%d bytes written.\n"), nbytes);
    close(fh);
}

On a LE machine (Intel):

00000000 41 42 43 00 41 00 42 00 ABC.A.B.
00000008 43 00 00 00 41 00 00 00 C...A...
00000010 42 00 00 00 43 00 00 00 B...C...
00000018 00 00 00 00 ....

On a BE machine:

00000000 41 42 43 00 00 41 00 42 ABC..A.B
00000008 00 43 00 00 00 00 00 41 .C.....A
00000010 00 00 00 42 00 00 00 43 ...B...C
00000018 00 00 00 00 ....

You were proposing something like there would be a register that would take
care of that. Suppose on the LE machine above, a 32-bit register would "set
things straight", you would get:

00000000 00 43 42 41 00 42 00 41 .CBA.B.A
00000008 00 00 00 43 00 00 00 41 ...C...A
00000010 00 00 00 42 00 00 00 43 ...B...C
00000018 00 00 00 00 ....

As you can see, the endian-ness is a problem for each data type
specifically, so there is no general remedy.

Lars Kristan



This archive was generated by hypermail 2.1.2 : Sat Apr 13 2002 - 08:26:42 EDT