More than you wanted to know about GB2312

From: Tom Emerson (tree@basistech.com)
Date: Thu Nov 02 2000 - 21:25:18 EST


This message provides a brief description of how the GB2312 encoding
(really EUC-CN, GB2312 is properly a character set, not an encoding)
works, including how to convert between row-cell and hex notation, and
what a octet stream looks like when it contains GB2312 code points.

By way of exposition, I'll use the Simplified Characters for
Zhong1guo2 (China), U+4E2D U+58B1.

The GB2312 hex values for these characters is 0x5650 0x397A. To
convert these to row-cell, subtract 0x2020 from each and convert each
byte to decimal:

GB2312
Hex Value 0x5650 0x397A
           - 0x2020 - 0x2020
           -------- --------
             0x3630 0x195A
Row-Cell 54-48 25-90

So the row-cell values for these characters are 54-48 and 25-90.

In a text stream, GB2312 is encoded using an 8-bit encoding,
EUC-CN. Since GB2312 is a 7-bit encoding, to differentiate the Chinese
characters the high-bit is set, making the 8-bit. To accomplish this,
you 0x80 to the hex value, or 0xA0 to the row-cell value (which makes
sense, since the row-cell value is 0x20 less than the hex value, and
adding 0x80 to the hex value creates the EUC-CN value). So:

GB2312
Hex Value 0x5650 0x397A
          + 0x8080 + 0x8080
          -------- --------
EUC-CN 0xD6D0 0xB9FA

And indeed, if you create a GB-2312 encoded file containing Zhong1guo2
and then look at the hex values, this is what you will see. RFC 1922
(which defines ISO-2022-CN) calls this CN-GB encoding.

I know this is confusing, but hopefully this has helped a bit.

-- 
Tom Emerson                                          Basis Technology Corp.
Zenkaku Language Hacker                            http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT