From: jon@hackcraft.net
Date: Mon Jan 12 2004 - 06:55:29 EST
Quoting Marco Cimarosti <marco.cimarosti@essetre.it>:
> Doug Ewell wrote:
> > In UTF-16 practically any sequence of bytes is valid, and since you
> > can't assume you know the language, you can't employ distribution
> > statistics. Twelve years ago, when most text was not Unicode and all
> > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> > of checking every other byte to see if it was zero, which of course
> > would only work for Latin-1 text encoded in UTF-16.
>
> I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
> BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
> this method was suggested first by Microsoft: to me, it seems quite
> self-evident.
>
> It is extremely unlikely that a text file encoded in any single- or
> multi-byte encoding (including UTF-8) would contain a zero byte, so the
> presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
> UTF-32.
False positives can be caused by the use of U+0000 (which is most often encoded
as 0x00) which some applications do use in text files. Hence you need to look
for sequences where there is a null octet every other octet, which increases
the risk of false negatives:
False negatives can be caused by text that doesn't contain any Latin-1
characters.
The method can be used reliably with text files that are guaranteed to contain
large amounts of Latin-1 - in particular files for which certain ASCII
characters are given an application-specific meaning; for instance XML and HTML
files, comma-delimited files, tab-delimited files, vCards and so on. It can be
particularly reliable in cases where certain ASCII characters will always begin
the document (e.g. XML).
-- Jon Hanna <http://www.hackcraft.net/> *Thought provoking quote goes here*
This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 07:36:44 EST