Re: FYI: Google blog on Unicode

From: Doug Ewell (doug@ewellic.org)
Date: Mon Feb 08 2010 - 20:43:57 CST

Next message: Doug Ewell: "Re: FYI: Google blog on Unicode"

Previous message: Kenneth Whistler: "Re: NFD normalisation test"
In reply to: Mark Davis ☕: "Re: FYI: Google blog on Unicode"
Next in thread: Michael D'Errico: "GB18030 (was Re: FYI: Google blog on Unicode)"
Reply: Michael D'Errico: "GB18030 (was Re: FYI: Google blog on Unicode)"
Reply: verdy_p: "Re: FYI: Google blog on Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis ☸ wrote:

> There are really two methodologies in question.
>
> 1. Accept the charset tagging without question.
> 2. Use charset detection, which uses a number of signals. The primary
> signal is a statistical analysis of the bytes in the document, but the
> charset tagging is taken into account (and can sometimes make a
> difference).
>
> The issue is whether, on balance, which of these produces better
> results for web pages and other documents. And with pretty exhaustive
> side-by-side comparisons of encodings, it is clear that #2 does,
> overwhelmingly.

What about option 1½: Use charset detection, assisted by the charset
tagging. That is, if the content is valid UTF-8 or UTF-16, or something
else unambiguous like GB18030, ignore the tagging and trust the
detection algorithm fully. But if the algorithm shows that it could
reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
trust the tag. Just a thought.

--
Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s

Next message: Doug Ewell: "Re: FYI: Google blog on Unicode"
Previous message: Kenneth Whistler: "Re: NFD normalisation test"
In reply to: Mark Davis ☕: "Re: FYI: Google blog on Unicode"
Next in thread: Michael D'Errico: "GB18030 (was Re: FYI: Google blog on Unicode)"
Reply: Michael D'Errico: "GB18030 (was Re: FYI: Google blog on Unicode)"
Reply: verdy_p: "Re: FYI: Google blog on Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 20:48:53 CST