In a message dated 2001-04-10 3:04:09 Pacific Daylight Time, 
tomas.mcguinness@cmg.nl writes:
>  When looking at a document would it be safe to assume that if you found any
>  of the following Byte Order Marks 
>  *    0xFFFE (UCS-2 Little Endian)
>  *    0xFEFE (UCS-2 Big Endian)
should be 0xFEFF
>  *    0xEFBBBF (UTF-8)
>  That the document is encoded with that encoding format. That means that if 
I
>  found the first 3 octets to be EF BB EF could I assume I am dealing with a
>  UTF-8 Document.
That is usually a safe assumption and a good practice, except that if the 
first two bytes are 0xFF 0xFE, you should check the next two to see if they 
are 0x00 0x00 (which would signify little-endian UCS-4).
Also, think in terms of UTF-16, not UCS-2.
>  Apart from UTF and Unicode/UCS encoding formats do any other "legacy"
>  character sets use Byte Order Marks?
Good question.  I have not heard of any.
To follow up, what about signatures that are not necessarily byte order 
marks?  UTF-8 does not need a BOM, so the signature 0xEF 0xBB 0xBF is useful 
for the purpose Tomás mentioned, to indicate the encoding.  Do any other 
character sets have such signatures?
-Doug Ewell
 Fullerton, California
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT