>> So there is a BOM-ambiguity when a file starts with
>>     FF FE
>> and then a couple of U+0000 characters, yes? Because this could be 
>> either UTF-16 or UTF-32 under little-endianness. Has this been 
>> pointed out and discussed beforehand?
>
> No, there is not a "BOM-ambiguity". Rather, there is an English ambiguity
> in your question concerning the meaning of "a file" and its contents.
>
> If "a file" is a byte stream interpreted as an LE Unicode 16-bit 
> string, then:
> FF FE 00 00 82 04 01 00 ...  --> <U+FEFF, U+0000, U+0482, U+0001>
>
> If "a file" is a byte stream interpreted as an LE Unicode 32-bit 
> string, then:
> FF FE 00 00 82 04 01 00 ...  --> <U+FEFF, U+10482>
>
> [...]
I appreciate the input, but I think it's not that simple. There are a 
number of contexts where I know that a file is for sure a textfile and I 
also either know that it's Unicode or I'm assuming that it is because it 
starts with one of the common bit-incarnations of the BOM.
With that in mind, there is value in documenting, however briefly, that 
reading FF FE 00 00 is by itself technically ambiguous. Because a lot of 
software developers might not want to think so much about such things 
and rather be told. I wish I could comment more here about how it's done 
in reality, but I can't because I don't even know how various editors' 
and Unix tools' file format heuristics look like because they're usually 
not documented.
Stephan
Received on Fri Jul 13 2012 - 21:35:14 CDT
This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 21:35:15 CDT