From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Jul 01 2010 - 16:49:07 CDT
On 7/1/2010 11:29 AM, John Burger wrote:
> Andreas Prilop wrote:
>
>>> The problem with slavishly following the charset parameter is
>>> that it is often incorrect.
>>
>> I wonder how you could draw such a conclusion. In order to make
>> such a statement, there must be some other (god-given?) parameter,
>> which is the "real charset".
>
>
> If you have never encountered a web page in which the charset
> parameter encoded in the page (or in the HTTP headers) did not
> accurately reflect the "real charset", as indicated by the actual data
> in the page, then your experience differs sharply from mine, and from
> everyone else I have ever met.
>
Let's unravel this.
First, there's qualitative vs. quantitative arguments. Yes, mis-tagging
occurs (for all the reasons Shawn gave in his reply). But Andreas' point
was that for languages needing more than ASCII, there's a nice
corrective. If many (most) viewers now base their display on charset,
then more documents would be expected to be correctly tagged for those
types of text, because they tend to degrade dramatically otherwise and
users (authors) would take action to correct the situation. The example
of this is reading a text as 8859-1 when it is 8859-2 (Eastern European)
This is different from the issue the issue of selecting the correct
charset, if it only affects some special symbols (copyright, punctuation
marks, the euro sign). In these cases, the text degrades in much more
subtle ways, and usually remains readable. I would expect that the
incidence of mis-tagging in such a situation is larger. The example for
this is reading a text as 8859-1 when it was 1252 (Windows code page
with extra characters not in ISO 8859-1 - Shawn mentioned this case as
well).
If I were to design a charset-verifier, I would distinguish between
these two cases. If something came tagged with a region-specific
charset, I would honor that, unless I found strong evidence of the "this
can't be right" nature. In some cases, to collect such evidence would
require significant statistics. The rule here should be "do no harm",
that is, destroying a document by incorrectly changing a true charset
should receive a nuch higher penalty than failing to detect a broken
charset. (That way, you don't penalize people who live by the rules :).
When it comes to a document tagged with 8859-1, I might relax this
slightly, as that tag is one of the common default tags and is more
likely to have been applied blindly.
When it comes to deciding whether something is Windows code page or a
true ISO charset, the bar can be set lower - one is a superset of the
other usually, and detecting any characters in the superset should
trigger a reassignment. Unlike the other case, the "penalties" for
getting this wrong are much less severe.
A./
This archive was generated by hypermail 2.1.5 : Thu Jul 01 2010 - 16:52:06 CDT