GB18030 (was Re: FYI: Google blog on Unicode)

From: Michael D'Errico (mike-list@pobox.com)
Date: Mon Feb 08 2010 - 23:03:38 CST

Next message: verdy_p: "Re: FYI: Google blog on Unicode"

Previous message: Doug Ewell: "Re: FYI: Google blog on Unicode"
In reply to: Doug Ewell: "Re: FYI: Google blog on Unicode"
Next in thread: Peter Krefting: "Re: GB18030"
Reply: Peter Krefting: "Re: GB18030"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Can anyone point me to a reference for converting between GB18030
and Unicode (in English)?

Thanks,

Mike

Doug Ewell wrote:
> Mark Davis ☸ wrote:
>
>> There are really two methodologies in question.
>>
>> 1. Accept the charset tagging without question.
>> 2. Use charset detection, which uses a number of signals. The primary
>> signal is a statistical analysis of the bytes in the document, but the
>> charset tagging is taken into account (and can sometimes make a
>> difference).
>>
>> The issue is whether, on balance, which of these produces better
>> results for web pages and other documents. And with pretty exhaustive
>> side-by-side comparisons of encodings, it is clear that #2 does,
>> overwhelmingly.
>
> What about option 1½: Use charset detection, assisted by the charset
> tagging. That is, if the content is valid UTF-8 or UTF-16, or something
> else unambiguous like GB18030, ignore the tagging and trust the
> detection algorithm fully. But if the algorithm shows that it could
> reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
> trust the tag. Just a thought.

Next message: verdy_p: "Re: FYI: Google blog on Unicode"
Previous message: Doug Ewell: "Re: FYI: Google blog on Unicode"
In reply to: Doug Ewell: "Re: FYI: Google blog on Unicode"
Next in thread: Peter Krefting: "Re: GB18030"
Reply: Peter Krefting: "Re: GB18030"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 23:03:00 CST