Stephan Stiller wrote:
> With that in mind, there is value in documenting, however briefly,
> that reading FF FE 00 00 is by itself technically ambiguous.
I have seen this documented many times, though I can't say for sure that
it was in official Unicode literature.
Even though you can never flat-out guarantee that a plain-text
application won't use U+0000, the fact is that very few do. And UTF-32
files are almost never seen outside of laboratory environments. So
you're probably safe in assuming that FF FE 00 00 is little-endian
UTF-32, and any other FF FE xx xx is little-endian UTF-16, and if you
want more assurance than that, apply a "halfway decent heuristic" like
this:
For a file to be little-endian UTF-32, the file size must be a multiple
of 4, and for each 4-byte chunk <aa bb cc dd>:
• aa bb must not be FE FF or FF FF
• cc must not be 11 through FF
• dd must be 00
• (add your own checks)
-- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell Received on Sun Jul 15 2012 - 20:20:24 CDT
This archive was generated by hypermail 2.2.0 : Sun Jul 15 2012 - 20:20:25 CDT