From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 10 2005 - 14:21:53 CDT
Philippe Verdy writes:
> This is absolutely not needed for a charset detector (i.e. the detection of
> the encoding used to serialize the text). HTML escapes are perfectly valid
> in HTML, and even if they refer to non Latin-1 characters, this does not
> change the fact that the page remains encoded in ISO-8859-1.
>
> You don't need to take HTML escapes into account with regards of which
> encoding is used, because these escapes are independant of the actual
> encoding used.
Agreed. But if you are interested in the langauge of the page as well
as the encoding, which some applications do care about, then you have
to take these into account. And, as I said, building a model that
accounts for language as well as encoding can help differentiate the
various Latin-n versions.
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "You can't fake quality any more than you can fake a good meal." (W.S.B.)
This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 14:22:41 CDT