From: Doug Ewell (dewell@adelphia.net)
Date: Mon Jan 12 2004 - 11:39:05 EST
Marco Cimarosti <marco dot cimarosti at essetre dot it> wrote:
>> In UTF-16 practically any sequence of bytes is valid, and since you
>> can't assume you know the language, you can't employ distribution
>> statistics. Twelve years ago, when most text was not Unicode and all
>> Unicode text was UTF-16, Microsoft documentation suggested a
>> heuristic of checking every other byte to see if it was zero, which
>> of course would only work for Latin-1 text encoded in UTF-16.
>
> I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
> BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite
> care) that this method was suggested first by Microsoft: to me, it
> seems quite self-evident.
I was referring specifically to the technique of checking every other
byte for zero, not checking whether there were zeros at all. Certainly,
if your only choices are UTF-16 and encodings that do not use zero
bytes, the first zero byte answers the question.
This is a specific case of the "ME state" described by Li and Momoi: a
byte sequence (0x00) is found which could only be in one of the possible
encodings.
> It is extremely unlikely that a text file encoded in any single- or
> multi-byte encoding (including UTF-8) would contain a zero byte, so
> the presence of zero bytes is a strong enough hint for UTF-16 (or
> UCS-2) or UTF-32.
Jon Hanna and Peter Kirk responded that U+0000 could occur in specific
types of text files used by certain applications, or in markup formats.
But it seems reasonable that in such cases, the process reading the file
would already know what format to expect.
> Of course, all this works only if it is true the basic assumption that
> the file is a plain text file: this method is not quite enough for
> telling apart text files from binary files.
Of course.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 12:31:44 EST