From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Jul 01 2010 - 16:49:07 CDT
On 7/1/2010 11:29 AM, John Burger wrote:
> Andreas Prilop wrote:
>
>>> The problem with slavishly following the charset parameter is
>>> that it is often incorrect.
>>
>> I wonder how you could draw such a conclusion. In order to make
>> such a statement, there must be some other (god-given?) parameter,
>> which is the "real charset".
>
>
> If you have never encountered a web page in which the charset 
> parameter encoded in the page (or in the HTTP headers) did not 
> accurately reflect the "real charset", as indicated by the actual data 
> in the page, then your experience differs sharply from mine, and from 
> everyone else I have ever met.
>
Let's unravel this.
First, there's qualitative vs. quantitative arguments. Yes, mis-tagging 
occurs (for all the reasons Shawn gave in his reply). But Andreas' point 
was that for languages needing more than ASCII, there's a nice 
corrective. If many (most) viewers now base their display on charset, 
then more documents would be expected to be correctly tagged for those 
types of text, because they tend to degrade dramatically otherwise and 
users (authors) would take action to correct the situation. The example 
of this is reading a text as 8859-1 when it is 8859-2 (Eastern European)
This is different from the issue the issue of selecting the correct 
charset, if it only affects some special symbols (copyright, punctuation 
marks, the euro sign). In these cases, the text degrades in much more 
subtle ways, and usually remains readable. I would expect that the 
incidence of mis-tagging in such a situation is larger. The example for 
this is reading a text as 8859-1 when it was 1252 (Windows code page 
with extra characters not in ISO 8859-1 - Shawn mentioned this case as 
well).
If I were to design a charset-verifier, I would distinguish between 
these two cases. If something came tagged with a region-specific 
charset, I would honor that, unless I found strong evidence of the "this 
can't be right" nature. In some cases, to collect such evidence would 
require significant statistics. The rule here should be "do no harm", 
that is, destroying a document by incorrectly changing a true charset 
should receive a nuch higher penalty than failing to detect a broken 
charset. (That way, you don't penalize people who live by the rules :).
When it comes to a document tagged with 8859-1, I might relax this 
slightly, as that tag is one of the common default tags and is more 
likely to have been applied blindly.
When it comes to deciding whether something is Windows code page or a 
true ISO charset, the bar can be set lower - one is a superset of the 
other usually, and detecting any characters in the superset should 
trigger a reassignment. Unlike the other case, the "penalties" for 
getting this wrong are much less severe.
A./
This archive was generated by hypermail 2.1.5 : Thu Jul 01 2010 - 16:52:06 CDT