From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 10 2005 - 13:36:12 CDT
Philippe Verdy writes:
> Note that the statistics also depend on the language actually used. The
> statistics for English will be quite different with Italian or French, and
> in some cases it will be hard to decide between ISO-8859-1 and ISO-8859-2
> for some Nordic or Baltic languages).
I've found that English is the "Great Corrupter" when it comes to
training these things: not only are English words found everythere,
but English has borrowed (or had the "borrowing" thrust upon it, not
that I'm bitter or anything ;-) so much from Germanic and Romance
languages over the last 1000 years that English can be easily confused
with French, Italian, or Dutch. Again, in my experience.
> - some webservers are labelling all pages with ISO-8859-1 even though it is
> another encoding or a UTF. Encoding exceptions are detected by the fact that
> HTML does not allow using some controls (but Internet Explorer silently
> accepts C1 controls in ISO-8859-1 as if they were in fact valid Windows-1252
> characters)
This happens *all* the time. I constantly encounter pages that are
labeled as ISO-8859-1 (actually usually CP1252) and indeed, if you
just look at the byte values, are valid Latin 1 (or even just
US-ASCII). However, the content is encoded in HTML escapes, and is
actually Arabic or Persian. Hence you have to do the detection in a
couple of steps, since the presence of these entities (remember, an
X?HTML page can include any character regardless of the declared
"primary" encoding) opens up all of Unicode. A heuristic along the
lines of: "If the page says it is (or detects as) Latin1 (or some
form), and it has some largish number of contiguous HTML entities,
transcode the whole thing into Unicode with the SGML entities
expanded, then run your language id again." This assumes, of course,
that you interested in identifying the language: doing this is almost
necessary if you want to differentiate the ISO-8859-n versions.
> - and some servers are labelling all with UTF-8 despite the texts are
> encoded with ISO-8859-1 (Exceptions occur when the UTF-8 encoding
> requirements are not respected within the document body, so if there is no
> leading BOM, IE tries to guess an alternate charset or displays square
> boxes, depending on user preferences or manual selection in the browser).
I've also seen misconfigured Apache 2 servers sending HTTP response
headers with a different encoding than that specified in the page,
usually to the detriment of all involved.
-tree
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "You can't fake quality any more than you can fake a good meal." (W.S.B.)
This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 13:37:17 CDT