From: Doug Ewell (doug@ewellic.org)
Date: Mon Feb 08 2010 - 20:43:57 CST
Mark Davis ☸ wrote:
> There are really two methodologies in question.
>
> 1. Accept the charset tagging without question.
> 2. Use charset detection, which uses a number of signals. The primary
> signal is a statistical analysis of the bytes in the document, but the
> charset tagging is taken into account (and can sometimes make a
> difference).
>
> The issue is whether, on balance, which of these produces better
> results for web pages and other documents. And with pretty exhaustive
> side-by-side comparisons of encodings, it is clear that #2 does,
> overwhelmingly.
What about option 1½: Use charset detection, assisted by the charset
tagging. That is, if the content is valid UTF-8 or UTF-16, or something
else unambiguous like GB18030, ignore the tagging and trust the
detection algorithm fully. But if the algorithm shows that it could
reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
trust the tag. Just a thought.
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ http://is.gd/2kf0s
This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 20:48:53 CST