Re: Clean and Unicode compliance

From: James Kass (jameskass@worldnet.att.net)
Date: Sun Dec 16 2001 - 23:50:24 EST


Martin Duerst wrote,

> As the person who implemented UTF-8 checking for http://validator.w3.org,
> I beg to disagree. In order to validate correctly, the validator has
> to make sure it correctly interprets the incomming byte sequence as
> a sequence of characters. For this, it has to know the character
> encoding. As an example, there are many files in iso-2022-jp or
> shift_jis that are prefectly valid as such, but will get rejected
> by some tools because they contain bytes that correspond to '<' in
> ASCII as part of a doublebyte character.
>

Excellent example. Use of less-than bracket bytes in certain
encoding methods hadn't occurred to me.

HTML validators need to be aware of the encoding used in the
file. Based on your comments and other comments in this thread,
I concede the point. A validator should validate that the plain
text portion of an HTML file is properly encoded/well formed.

Best regards,

James Kass.



This archive was generated by hypermail 2.1.2 : Sun Dec 16 2001 - 22:36:32 EST