Fun with GBK & GB2312

From: Ken Krugler (ken@transpac.com)
Date: Fri Jan 04 2002 - 18:51:27 EST


Hi list,

We're having fun trying to resolve issues with GBK and GB2312
character sets, and are hoping that list members might offer some
input. What we're using for source info is the CP936.txt file from
the Microsoft section of vendor mapping tables on the Unicode web
site, and Ken Lunde's CJKV book.

1. Does anybody know of a table that explicitly describes every
character in the GBK character set? That would go a long way towards
resolving our issues.

2. A contributor to this list (Mike Brown) said that "GBK" is a
Sun-Java character set name, but I don't see it in the IANA registry.
Does anybody know if any software is actually tagging data with "GBK"?

3. In the Simplified Chinese font that we're using, 0xA2E3 (the GBK
code point) is a copyright symbol. This code point isn't in the
CP936.txt mapping table, but a contributor to this list
(kline_s@cup.hp.com) said that 0xA2E3 in GBK is the Euro (a
late-breaking extension?) Note that there is no copyright symbol at
all in the CP936.txt table (nothing maps to u00A9), which seems odd.

4. The CP936.txt table has (21920 - 129 single-byte) = 21791
double-byte characters. GBK should have (717+6763+6090+8160+166) =
21866 double-byte characters (at least according to Ken Lunde's CJKV
Information Processing. Therefore, there appear to be (21866-21791)
= 95 GBK characters missing from the CP936.txt table.

On the other hand, 94 half-width ASCII and 32 half-width Pinyin
defined by GB 6345.1-86 are probably in CP936, but just not in
CP936.txt (because they don't have corresponding round-trip Unicode
code points). Then there's the Euro and copyright characters, plus
two unmapped Pinyin characters (0xA8BC and 0xA8BF). Based on the
characters which we know are not in CP936.txt, there appear to be
(94+32+2+2-95) = 35 EXTRA characters in CP936.txt which are
unaccounted for (at least given Ken Lunde's GBK counts). These might
be Microsoft extensions, which would make CP936 a superset of GBK.

5. The 94 half-width ASCII characters added to row 10 by GB 6345.1-86
(0xAAA1-0xAAFD) should be part of GBK according to Ken Lunde's
description on p.89, "GB 2312-80 base (with corrections and additions
specified in GB 6345.1-86)", but these characters don't appear in his
Table 3-32 on the same page. It seems they should be part of GBK/1,
since that's where Lunde put the other non-Hanzi from GB 12345-90
(which includes the GB 6345.1-86 additions).

He also gives specific encoding details on p.170, and his Table 4-37
there specifies that 0xAAA1-0xAFFE are user-defined regions within
GBK.

Unfortunately the CP936.txt file can't specify whether they're part
of CP936, as they appear to have no unique Unicode equivalents. The
first 93 of them do appear in the Simplified font we're using. So,
are these 94 half-width ASCII characters user-defined additions, or
are they indeed part of GBK?

Thanks,

-- Ken

Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200



This archive was generated by hypermail 2.1.2 : Fri Jan 04 2002 - 18:29:14 EST