Re: Win IE 7b2 and UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 15 2006 - 09:40:49 CDT

  • Next message: Philippe Verdy: "Re: Win IE 7b2 and UTF-8"

    From: "Keutgen, Walter" <walter.keutgen@be.unisys.com>
    > In this case clearly the server owners, authoring tool providers and authors
    > are to blame. Is it really so difficult to comply with the HTML standard
    > and tag correctly?

    Tagging is not always possible, notably on personnal homepages hosted by user's ISP on a shared server which does not allow setting server-side meta-data files. So the HTTP headers are fixed for all HTML pages, whatever their content.

    Also users don't always know what to do, andthey often create pages using ISO-8859-1 (or Windows-1252) simply because it's the default encoding. And trying to change the page encoding to UTF-8 does not always work on these shared servers that are interpreting things like server-side includes only with a single encoding.

    So the server either does not set any HTTP header other than the "text/html" MIME type for HTML pages, without specifying the charset (and the server also forbids attempts to change it in PHP pages, for various security reasons).

    This situation is quite common. So there are many pages hosted by servers that set a unique HTTPheader for all HTML pages, and pages are created in ISO-8859-1 without any additional charset tagging.

    This is this case where IE will behave incorrectly, and it happens much more often than the case of pages created with broken UTF-8 that I have still never seen. If these ISO-8859 encoded pages are now interpreted as UTF-8 with a too liberal decoder in IE. IT breaks those pages.

    This is caused by the broken charset autodetection in IE which selects UTF-8 even though the page is definitely not valid UTF-8, but a completely valid iSO-8859 encoded page. Nothing indicates in the page that UTF-8 is used, UTF-8 should not be selected by IE.

    All the above is the first point against this liberal mode. It does not help users and is moredamaging than helpful. I spoke about French users, because that's something I've seen or experimented (French pages displaying Han ideographs without any justifiable reasons, given that the pages are definitely not UTF-8 pages), but it may affect many european languages as well. The correction for this point is in the charset autodetection algorithm of IE.

    The second point against this liberal mode is security against malicious pages that are attempting to use this bug to bypass some security checks (like forbidden characters) in active components running on the client host or in a server, using an explicit UTF-8 charset tagging, despite the page was created using invalid UTF-8. Ireally consider this as a bug (authors of security checkswith UTF-8 data may know the rules about non-shortest sequences from the old RFC UTF-8 definition and may test them, but few know that UTF-8 trailing bytes may have invalid but working encodings, because IE only tests the leading bytes)



    This archive was generated by hypermail 2.1.5 : Mon May 15 2006 - 09:45:45 CDT