From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Nov 17 2003 - 12:48:41 EST
From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
To: "'Pim Blokland'" <pblokland@planet.nl>; "Unicode mailing list"
<unicode@unicode.org>
> Pim Blokland wrote:
> > Not only that, but the process making the mistake of thinking it is
> > UTF-8 also makes the mistake of not generating an error for
> > encountering malformed byte sequences,
>
> BTW, this process has a name: "Internet Explorer".
Don't blame IE too much if it attempts to interpret the text using UTF-8,
because the page is tagged explicitly with a UTF-8 charset. Well, it's true
that IE should stop to use this erroneous charset tag as soon as it sees a
violation of the UTF-8 rule, and rather should attempt to use its "automatic
selection". But it's true also, that IE still attempts to use the legacy
UTF-8 encoding which allowed interpreting non-short sequences.
I do think this bug does not occur within recent updates of IE, notably
since it was corrected to remove the security hole in MSHTML.DLL to avoid
interpreting non-short sequences. If IE really wants to keep some
compatibility, it may only accept the CESU-8 encoding only as a possible
choice for its "automatic selection" of charsets, or display a visible
replacement character (such as a narrow white box) for invalid characters
(that could internally be handled as if these invalid sequences were
representing U+FFFF).
But if the user forces the UTF-8 decoding in the GUI, IE should still not
consider any invalid UTF-8 sequence, and interpret it as an invalid
character like U+FFFF or, even better, disable this UTF-8 choice in the user
interface.
So this is really an effect of the collision of multiple Unicode violations,
both in the User-Agent interpreting the coded strings, and in the content of
the page, incorrectly labelled UTF-8 when it is not (here: complain to your
web page designer, or blame yourself if you created this page with invalid
meta-tags).
Beware, when editing an UTF-8 page that includes the UTF-8 charset metatag
explicitly, that your editor will not save it into ISO-8859-1, only because
it thinks it will save storage space...
There are also of some bogous "web site optimizers" that perform this kind
of encoding optimization (in addition to removing unnecessary spaces and new
lines, or to compressing/obfuscating the JavaScript code, CSS stylesheet
class names) and don't take care of changing the value of this meta-tag...
Changing the internal encoding of any text file without an explicit request
from the user should never be done automatically without confirmation and
logging of the actions taken.
This archive was generated by hypermail 2.1.5 : Mon Nov 17 2003 - 13:49:46 EST