Re: Byte Order Marks

From: DougEwell2@cs.com
Date: Tue Apr 10 2001 - 11:50:30 EDT


In a message dated 2001-04-10 3:04:09 Pacific Daylight Time,
tomas.mcguinness@cmg.nl writes:

> When looking at a document would it be safe to assume that if you found any
> of the following Byte Order Marks
> * 0xFFFE (UCS-2 Little Endian)
> * 0xFEFE (UCS-2 Big Endian)

should be 0xFEFF

> * 0xEFBBBF (UTF-8)
> That the document is encoded with that encoding format. That means that if
I
> found the first 3 octets to be EF BB EF could I assume I am dealing with a
> UTF-8 Document.

That is usually a safe assumption and a good practice, except that if the
first two bytes are 0xFF 0xFE, you should check the next two to see if they
are 0x00 0x00 (which would signify little-endian UCS-4).

Also, think in terms of UTF-16, not UCS-2.

> Apart from UTF and Unicode/UCS encoding formats do any other "legacy"
> character sets use Byte Order Marks?

Good question. I have not heard of any.

To follow up, what about signatures that are not necessarily byte order
marks? UTF-8 does not need a BOM, so the signature 0xEF 0xBB 0xBF is useful
for the purpose Tomás mentioned, to indicate the encoding. Do any other
character sets have such signatures?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT