From: Michael D'Errico (mike-list@pobox.com)
Date: Mon Feb 08 2010 - 23:03:38 CST
Can anyone point me to a reference for converting between GB18030
and Unicode (in English)?
Thanks,
Mike
Doug Ewell wrote:
> Mark Davis ☸ wrote:
>
>> There are really two methodologies in question.
>>
>> 1. Accept the charset tagging without question.
>> 2. Use charset detection, which uses a number of signals. The primary
>> signal is a statistical analysis of the bytes in the document, but the
>> charset tagging is taken into account (and can sometimes make a
>> difference).
>>
>> The issue is whether, on balance, which of these produces better
>> results for web pages and other documents. And with pretty exhaustive
>> side-by-side comparisons of encodings, it is clear that #2 does,
>> overwhelmingly.
>
> What about option 1½: Use charset detection, assisted by the charset
> tagging. That is, if the content is valid UTF-8 or UTF-16, or something
> else unambiguous like GB18030, ignore the tagging and trust the
> detection algorithm fully. But if the algorithm shows that it could
> reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
> trust the tag. Just a thought.
This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 23:03:00 CST