As the person who implemented UTF-8 checking for http://validator.w3.org,
I beg to disagree. In order to validate correctly, the validator has
to make sure it correctly interprets the incomming byte sequence as
a sequence of characters. For this, it has to know the character
encoding. As an example, there are many files in iso-2022-jp or
shift_jis that are prefectly valid as such, but will get rejected
by some tools because they contain bytes that correspond to '<' in
ASCII as part of a doublebyte character.
So the UTF-8 check is just to make sure we validate something
reasonable, and to avoid GIGO (garbage in, garbage out).
Of course, this cannot be avoided completely; the validator
has no way to check whether something that is sent in as
iso-8859-1 would actually be iso-8859-2. (humans can check
by looking at the source).
Regards, Martin.
At 12:26 01/12/14 -0800, James Kass wrote:
>There is so much text on the web using many different
>encoding methods. Big-5, Shift-JIS, and similar encodings
>are fairly well standardised and supported. Now, in addition
>to UTF-8, a web page might be in UTF-16 or perhaps even
>UTF-32, eventually. Plus, there's a plethora of non-standard
>encodings in common use today. An HTML validator should
>validate the mark-up, assuring an author that (s)he hasn't
>done anything incredibly dumb like having two </title>
>tags appearing consecutively. Really, this is all that we should
>expect from an HTML validator. Extra features such as
>checking for invalid UTF-8 sequences would probably be most
>welcome, but there are other tools for doing this which an
>author should already be using.
>
>Best regards,
>
>James Kass.
>
This archive was generated by hypermail 2.1.2 : Sun Dec 16 2001 - 19:23:27 EST