Re: gb2312

From: Jungshik Shin (jshin@mailaps.org)
Date: Tue Apr 10 2001 - 10:00:29 EDT


On Tue, 10 Apr 2001, Tomas McGuinness wrote:

> Is the character set gb2312 encoded in a two octet scheme? If so does it pad
> out its ascii characters to two octets e.g. the character < is 0x3C in ascii
> so does it become 0x003C in gb2312?

  No !! In EUC-CN(which is a better name for what you're calling GB2312),
each character in US-ASCII/ISO 646 is represensented exactly the same
way as it's represented in US-ASCII and ISO-8859-x.

   Your confusion arised from careless/unfortunate use/mix-up of CES
(Character set Encoding Scheme) and CCS(Coded Character Set) (I'm
following the terminology of IETF RFC 2130). Strictly speaking,
GB 2312-80 is a 94 x 94 (2byte) coded character set and as such it
does NOT include any character from US-ASCII / ISO 646. That's why the
mapping table for GB2312 in Unicode ftp archive does not have US-ASCII
characters(all the values in the table have MSB unset so that you have
to add 0x8080 to get EUC values as pointed out in this list before) .
What is typically refered to as GB2312 (encoding) should have been called
EUC-CN (to avoid this kind of confusion). EUC-CN (which is one of a few
CES' that uses GB 2312-80 as CCS along with other CCS like US-ASCII/ISO
646) is a CES(multibyte)

   In EUC-CN, US-ASCII/ISO 646 is invoked on GL (that is, the octet range
[0x21-0x7E]) as G0 and GB 2312-80 is invoked on GR ( [0xA1-0xFE]. Because
it's 94x94 2byte character set, actual octets used are *doublet* of
[0xA1-0xFE]). Here's the summary of EUC-CN encoding scheme (quoted
from "CJKV Information Processing" by Ken Lunde. If you have to
deal with CJKV text/encoding, you'd better buy this book)

                          range
  CCS 0 (codeset 0) 0x21-0x7E : US-ASCII/ISO 646
  CCS 1 (codeset 1)
     1st byte: 0xA1-0xFE : GB 2312-80
     2nd byte: 0xA1-0xFE

  Jungshik Shin



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT