From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Feb 08 2010 - 13:06:15 CST
It is unclear exactly what point you are trying to make.
There are really two methodologies in question.
1. Accept the charset tagging without question.
2. Use charset detection, which uses a number of signals. The primary
signal is a statistical analysis of the bytes in the document, but the
charset tagging is taken into account (and can sometimes make a difference).
The issue is whether, on balance, which of these produces better results for
web pages and other documents. And with pretty exhaustive side-by-side
comparisons of encodings, it is clear that #2 does, overwhelmingly.
Of course, the less the contents of the document look like "real text", the
more likely that #2 will produce the incorrect results. But we have to look
at the approach that produces the best results overall.
Mark
On Mon, Feb 8, 2010 at 09:40, Andreas Prilop <prilop4321@trashmail.net>wrote:
> On Fri, 29 Jan 2010, Mark Davis wrote upside-down:
>
> > It is encodings determined by a detection algorithm.
>
> This is so stupid!
>
> The results can be seen here:
> http://groups.google.co.uk/group/pl.test/msg/1fa7fa753aad46a2
>
> Special characters are often messed up in groups.google
> because your stupid algorithm takes ISO-8859-1 when the
> message is actually ISO-8859-2 or ISO-8859-15 or whatever.
>
> http://groups.google.co.uk/group/pl.test/msg/359af83289a00e8e
>
> > The declarations for encodings (and language)
> > are far too unreliable to be depended on.
>
> Unreliable is a guy who doesn't even know how to quote.
>
>
This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 13:10:28 CST