Re: BOM ambiguity? from Doug Ewell on 2012-07-14 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Sat, 14 Jul 2012 14:50:06 -0600

Stephan Stiller wrote:

> With that in mind, there is value in documenting, however briefly,
> that reading FF FE 00 00 is by itself technically ambiguous.

I have seen this documented many times, though I can't say for sure that
it was in official Unicode literature.

Even though you can never flat-out guarantee that a plain-text
application won't use U+0000, the fact is that very few do. And UTF-32
files are almost never seen outside of laboratory environments. So
you're probably safe in assuming that FF FE 00 00 is little-endian
UTF-32, and any other FF FE xx xx is little-endian UTF-16, and if you
want more assurance than that, apply a "halfway decent heuristic" like
this:

For a file to be little-endian UTF-32, the file size must be a multiple
of 4, and for each 4-byte chunk <aa bb cc dd>:

• aa bb must not be FE FF or FF FF
• cc must not be 11 through FF
• dd must be 00
• (add your own checks)

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Received on Sun Jul 15 2012 - 20:20:24 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 15 2012 - 20:20:25 CDT