RE: gb2312

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Apr 10 2001 - 08:37:26 EDT


Tomas McGuinness wrote:
> Is the character set gb2312 encoded in a two octet scheme?

It is one of the so-called "double byte character sets" (DBCS), but this
name is misleading: "multibyte character set" (MBCS) is a better definition.

> If so does it pad out its ascii characters to two octets
> e.g. the character is 0x3C in ascii so does it become
> 0x003C in gb2312?

No. ASCII characters are represented as a single byte. All other characters
are represented as pairs of bytes, called "lead byte" and "trail byte".

In 7-bit encoding, the same bytes are used both as single- and double-byte
characters. The different interpretation is determined by escape sequences.

In 8-bit encoding, the matter is simpler: all lead bytes are in the range
0x80 to 0xFF, which is unused by ASCII (ASCII is a 7-bit encoding, limited
to the range 0x00 to 0x7F).

See a brief introduction in Roman Czyborra's site:

        http://czyborra.com/charsets/cjk.html

Especially the part about EUC, which is the most popular 8-bit encoding for
Far East character sets (including GB2312):

        http://czyborra.com/utf/#EUC

For more detailed information read CJK.INF, a famous document by Ken Lunde:

        ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
        http://www.ora.com/people/authors/lunde/cjk_inf.html

For even more detailed information, consider the books by the same author,
both derived from his old CJK.INF:

        http://www.ora.com/catalog/ujip/
        http://www.oreilly.com/catalog/cjkvinfo/

Also have a look at Koichi Yasuoka's page:

        http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html

Especially the "U-" field in the Unicode-to-GB mapping table, which shows
you EUC encoded GB:

        http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/Uni2GB.Z

Rgds.
_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT