Tomas McGuinness wrote:
> Is the character set gb2312 encoded in a two octet scheme?
It is one of the so-called "double byte character sets" (DBCS), but this
name is misleading: "multibyte character set" (MBCS) is a better definition.
> If so does it pad out its ascii characters to two octets
> e.g. the character is 0x3C in ascii so does it become
> 0x003C in gb2312?
No. ASCII characters are represented as a single byte. All other characters
are represented as pairs of bytes, called "lead byte" and "trail byte".
In 7-bit encoding, the same bytes are used both as single- and double-byte
characters. The different interpretation is determined by escape sequences.
In 8-bit encoding, the matter is simpler: all lead bytes are in the range
0x80 to 0xFF, which is unused by ASCII (ASCII is a 7-bit encoding, limited
to the range 0x00 to 0x7F).
See a brief introduction in Roman Czyborra's site:
http://czyborra.com/charsets/cjk.html
Especially the part about EUC, which is the most popular 8-bit encoding for
Far East character sets (including GB2312):
For more detailed information read CJK.INF, a famous document by Ken Lunde:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
http://www.ora.com/people/authors/lunde/cjk_inf.html
For even more detailed information, consider the books by the same author,
both derived from his old CJK.INF:
http://www.ora.com/catalog/ujip/
http://www.oreilly.com/catalog/cjkvinfo/
Also have a look at Koichi Yasuoka's page:
http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html
Especially the "U-" field in the Unicode-to-GB mapping table, which shows
you EUC encoded GB:
http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/Uni2GB.Z
Rgds.
_ Marco
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT