Endian Checker [Was: Re: MS/Unix BOM FAQ again (small fix))

From: Dan Kogai (dankogai@dan.co.jp)
Date: Fri Apr 12 2002 - 14:26:09 EDT

Previous message: Tom Gewecke: "Re: Please help: Unicode sig in Hotmail"
In reply to: George W Gerrity: "Re: MS/Unix BOM FAQ again (small fix)"
Next in thread: Lars Kristan: "RE: MS/Unix BOM FAQ again (small fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Friday, April 12, 2002, at 10:38 , George W Gerrity wrote:
> To expand on this, imagine there is a text file in some encoding on
> some medium created by a little-endian machine (say a DEC Vax or a
> Macintosh 68000), and it is to be accessed on a big-endian machine (any
> Intel 8080 -- Pentium architecture).

I KNOW someone will pick this nit sooner or later but since I recently
implemented all UTF-(16|32)(LE|BE)? for Encode, a perl module (also the
biggest in size thanks to CJK Unification :) comes will every Perl 5.8)
that does all the Unicode-(from|to)-X transcoding, it would be
appropriate for me to do so.
Your endianness for MC 680x0 and IA-32 is upside down. 68k is in
network byte order and IA-32 is in VAX byte order. Here is a C
one-liner that tells the endianness. 4 for BE and 1 for LE.

int main(){ int e=0x04030201 ; printf("%d\n", *((char *)&e)); }

And of course, in perl one-liner.

perl -e 'print pack("C", unpack("L", "1234")), "\n"'

> I acknowledge that the BOM _can_ be used to differentiate between
> various encodings -- UTF-8, UTF-16, UTF-32, non-Unicode -- but then,
> that has _nothing_ to do with byte order. Perhaps it should be renamed?

It definitely has A LITTLE to do -- If BOM is the opposite of the
endianness of your computer, flip the bytes before going anything
further. It does not say anything about the endianness of her/his
machine where the data is originated because any computer can choose to
prepend both versions of the BOM, however.

FYI Encode module uses BE BOM when it encodes (from perl's native UTF-8)
to UTF-16 or UTF-32 with no endianness specified, even on my FreeBSD box
which is LE. So far there is no such option to make BOMmed UTF-16 or
UTF-32 with little endian BOM. This decision has made the code much
simpler and should not affect usability a bit. But if you don't like
it, this weekend is the last chance to say so because code freeze is
coming!

Dan the BOMeed Man

Previous message: Tom Gewecke: "Re: Please help: Unicode sig in Hotmail"
In reply to: George W Gerrity: "Re: MS/Unix BOM FAQ again (small fix)"
Next in thread: Lars Kristan: "RE: MS/Unix BOM FAQ again (small fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Apr 12 2002 - 12:52:42 EDT