Re: GBK, HZ and EUC-TW

From: Tom Emerson (tree@basistech.com)
Date: Sat Jan 06 2001 - 23:19:35 EST


Lars Marius Garshol writes:
> I am currently trying to implement converters for the three encodings
> mentioned in the subject, and am missing some pieces. If anyone has
> any a GBK-to-Unicode mapping table or knows of example web pages or
> text documents in any of these encodings, I would be happy to hear
> about them.

GBK is defined in an annex to GB 13000 (essentially the PRC
Translation of Unicode 1.1) in order to bring GB 2312:80 inline with
the rest of the ideographs in the I block of Unicode. As others have
mentioned, this is also the character set adopted by Microsoft for
CP936.

Ken Lunde's "CJKV Information Processing" has a good description of
the evolution and interrelationships between the GB standards.

As far as mapping tables go, the best one you'll find is the Microsoft
or ICU mapping tables. I personally have not seen an official mapping
table from GB 13000. As others have noted, Microsoft has extended the
"pure" GBK with Euro, and perhaps other code points.

GB 2312:80 is a proper subset of GBK, so you can map EUC-CN encoded
text to Unicode using a GBK mapping table. Be aware, though, that
going the other direction can be problematical: GBK can contains code
points that do not exist within GB 2312:80, so you need to be careful
going the other direction.

HTH,

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Zenkaku Language Hacker                            http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT