From: Andrew West (andrewcwest@gmail.com)
Date: Mon Dec 28 2009 - 05:33:33 CST
2009/12/28 Doug Ewell <doug@ewellic.org>:
>
> Ā U+0100 LATIN CAPITAL LETTER A WITH MACRON
> in UTF-32BE: { 00 00 01 00 }
> in UTF-32LE: { 00 01 00 00 }
>
> 𐀀 U+10000 LINEAR B SYLLABLE B008 A
> in UTF-32BE: { 00 01 00 00 }
> in UTF-32LE: { 00 00 01 00 }
>
> Naturally you wouldn't have a whole string of these in real life, so the
> heuristic would work.
You can't make that assumption. Linear B users are much more likely to
use UTF-32 than other users, so a string of the above byte sequences
may be more likely to be a string of LINEAR B SYLLABLE B008 A
characters even though that is a far rarer character than LATIN
CAPITAL LETTER A WITH MACRON. So I can't see how the heuristic would
be able to know whether it was big-endian or little-endian in this
case.
I've just tested the scenario with BabelPad and it autodetects a
string of U+0100 characters saved as UTF32LE with no BOM as UTF32BE
(i.e. a string of U+10000 characters), and autodetects a string of
U+10000 characters saved as UTF32LE with no BOM as UTF32BE (i.e. a
string of U+0100 characters). Has the heuristic failed? Probably,
because on Windows, all things being equal, little-endian should be
assumed rather than big-endian. (Of course, once you add a CR/LF to
the file the heuristic correctly autodetects both files as UTF32LE.)
Andrew
This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 05:36:54 CST