Re: Fun with GBK & GB2312

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Jan 04 2002 - 19:56:59 EST


We have published mapping data for Windows cp936 from the actual Windows 2000 converter API. This is probably more up to date and complete than what is listed on the unicode.org site.
Of course, these tables also "only" show correspondences with Unicode, but
a) they also show unidirectional mappings, unlike the unicode.org tables
b) many modern systems (Windows, MacOS, Java, etc.) process all text always in Unicode, so what does not have a mapping to Unicode does not get processed (and you may not need to worry about it)

The GBK table from that is available in the Unicode TR 22 XML format at http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/windows-936-2000.xml?content-type=text/plain
and in the ICU-specific .ucm format at http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/ucm/windows-936-2000.ucm?content-type=text/plain

The main page for our repository is at http://oss.software.ibm.com/icu/charset/

As for your specific questions:

1. You can use the descriptions and properties of the equivalent Unicode characters according to the mapping. (Except for what maps to private-use code points.)

2. I don't know about actual tagging. The IANA list is at http://www.iana.org/assignments/character-sets
There is currently no registered name "GBK" in that list.

3. The Windows mappings show the Euro sign U+20AC at GBK 0x80. There is no mapping for the copyright sign U+00A9.
GBK 0xa2e3 is mapped to the private-use code point U+E76C.

Note that GBK is superseded by GB 18030. See the mapping table at http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/gb-18030-2000.xml?content-type=text/plain
There, U+20AC is mapped to GB 18030 0xa2e3.

Please check the above links for more questions about what is mapped where.

markus



This archive was generated by hypermail 2.1.2 : Fri Jan 04 2002 - 19:32:36 EST