Re: BOM ambiguity? from Stephan Stiller on 2012-07-13 (Unicode Mail List Archive)

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Fri, 13 Jul 2012 19:32:49 -0700

>> So there is a BOM-ambiguity when a file starts with
>> FF FE
>> and then a couple of U+0000 characters, yes? Because this could be
>> either UTF-16 or UTF-32 under little-endianness. Has this been
>> pointed out and discussed beforehand?
>
> No, there is not a "BOM-ambiguity". Rather, there is an English ambiguity
> in your question concerning the meaning of "a file" and its contents.
>
> If "a file" is a byte stream interpreted as an LE Unicode 16-bit
> string, then:
> FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+0000, U+0482, U+0001>
>
> If "a file" is a byte stream interpreted as an LE Unicode 32-bit
> string, then:
> FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+10482>
>
> [...]

I appreciate the input, but I think it's not that simple. There are a
number of contexts where I know that a file is for sure a textfile and I
also either know that it's Unicode or I'm assuming that it is because it
starts with one of the common bit-incarnations of the BOM.

With that in mind, there is value in documenting, however briefly, that
reading FF FE 00 00 is by itself technically ambiguous. Because a lot of
software developers might not want to think so much about such things
and rather be told. I wish I could comment more here about how it's done
in reality, but I can't because I don't even know how various editors'
and Unix tools' file format heuristics look like because they're usually
not documented.

Stephan
Received on Fri Jul 13 2012 - 21:35:14 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 21:35:15 CDT