From: verdy_p (verdy_p@wanadoo.fr)
Date: Mon Feb 08 2010 - 23:41:03 CST
"Doug Ewell" wrote:
> What about option 1½: Use charset detection, assisted by the charset
> tagging. That is, if the content is valid UTF-8 or UTF-16, or something
> else unambiguous like GB18030, ignore the tagging and trust the
> detection algorithm fully. But if the algorithm shows that it could
> reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
> trust the tag. Just a thought.
One common cause of unreliable identification of ISO 8859-1 or -2 is that it is frequently replaced by their Windows
1252 or 1250 "extensions".
Including those common replacements (notably since they have been approved now in HTML5) should suggest that these
equivalences should be accepted (who actually uses the C1 controls in HTML? there's only one C1 control that is
standard in HTML 3/4 and it's been very long now, since the last time I saw a page using it for newlines, apparently
it occured only from IBM systems through an automatic conversion from some EBCDIC variant, but even those systems
now support ISO 8859 charsets by ignoring the differences between newlines, so they accept CR/LF equally)
If the algorithm takes the ISO 8859-x tag unreliable because the page contains some Windows 125x characters (in the
code range 0x80-0x9F), it is probably wrong: assume Windw 125x instead and use it as the secondary indicator (after
the statistic estimation euristic).
Some characters are also good indicators that a ISO 8859-x (or Windows 125x) charset is preset, the most frequent
being NBSP (U+00A0) which is increasingly present in really a lot of pages (notably within empty table cells used
for the page layout. Its presence automatically determines the difference between 8-bit charsets (ISO 8859-x,
Windows 125x), UTF's and other reliable encodings like GB18030 in China or even JIS variants in Japan, and KSC
variants in South Korea.
But other indicators are also important: using just statistics based on isolated characters will not be reliable
enough. For example the detection of NBSP is reliable within specific contexts like after the ">" ending a HTML tag,
or between a letter and some common punctuations, or between digits.
Does Google uses such context-based heuristic to improve the detector? I.e. does it try to look for ordered sets of
pairs or triplets of bytes, and does it adjust its statistic thresholds, based on the exposed document MIME type
which should be a reliable indicator to trust always ? (HTML, or CSS, or Javascript, or plain text).
XML is normally not ambiguous (its autodetection algorithm is fully specified for US-ASCII and UTF's only) and
should not even need to require the use of a custom detector (but this may be wrong, notably from various syndicated
RSS feeds, built from poorly configured PHP-based sites that actually don't use any CML-based DOM, but only
concatenate various strings looking mostly like if it was valid XML with the correct schema; Google must probably
have statistics about such errors, and I don't know how RSS readers can cope with such errors; may be there are
Windows 125x exceptions there too).
What about more specific encodings used through AJAX requests (for example JSON-formatted data, commonly used
instead of XML): is there a charset detector used by those requests performed in Chrome or Chromium, and does it
uses specific heuristics, and is there a way to disable it completely and force it to use the indicated charset or
to return a decoding error if this was wrong ?
What if the MIME type is also wrong or unknown (or is an unknown alias) ? Will Chrome handle it as if it was plain
text (for example if the exposed MIME type still matches "text/*", but not "image/*" or "application/*") ? Is there
a MIME type detector in that case ?
This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 23:44:11 CST