Re: Frequent incorrect guesses by the charset autodetection in IE7

From: James Kass (jameskass@att.net)
Date: Thu Jul 13 2006 - 22:43:19 CDT

  • Next message: Doug Ewell: "Re: Frequent incorrect guesses by the charset autodetection in IE7"

    Philippe Verdy wrote,

    > > The autodetection mechanism may be broken, but it can't really be blamed
    > > for breaking the HTML code and structure. Without a character set
    > > declaration, the HTML code is already broken. No HTML validator should
    > > pass such a page.
    >
    > Why that? the HTML code is correct, except when parsed with a multibyte charset,
    > which should not occur as this is not declared, and also which should be
    > detected by the heurisitc mechanism when it attempts to identify the charset.
    >
    > Note that the page does not specify the dtd version, this is then to be parsed
    > valid according to legacy HTML 3.2, and without the charset specification, an
    > ISO 8859-based charset should be used. Using ISO 8859 makes no parsing error.
    > Give me only one sentence in the HTML specs that says that the charset
    > indication is mandatory! In legacy HTML 3.2, ISO 8859-1 is even a charset whose
    > support is required, as confirmed in the normative DTDs, and the normative list
    > of named entities.

    The W3C only recommends the "charset" info in the meta tags section,
    but it is not mandatory.

    It should be, though. How can a parser parse if it doesn't know
    which character set to use?

    In the case of the French Red Cross page, the W3C HTML validator
    detects the character set as ISO-8859-15 and reports many errors
    in the HTML. Manually overriding the ISO-8859-15 and making the
    validator parse the web page as ISO-8859-1 still produces the same
    serious errors in the HTML code of that page.

    It would be interesting to see if correcting all the HTML errors
    would enable MSIE 7 beta to correctly auto-detect the character set.

    Quoting Chris Lilley (of w3.org)
    ( http://lists.xml.org/archives/xml-dev/199904/msg00081.html )
     "But autodetection should not be required; users can label their
    documents correctly."

    Best regards,

    James Kass

    (off topic - with regards to AT&T blocking e-mail from Orange, for
    many reasons I will be looking for a new ISP and regret that AT&T
    blocks certain incoming messages. Because of the tremendous amount
    of spam messages coming here, I was finally forced to use AT&T's
    spam filter. This spam filter is not user-configurable. It is
    suprising to hear that it blocks incoming valid messages such as
    yours while still allowing all kinds of 419 scam letters through.)



    This archive was generated by hypermail 2.1.5 : Thu Jul 13 2006 - 22:48:31 CDT