GB 18030 is a new Chinese
codepage standard that extends GB 2312-1980 and GBK (which itself is an
extension of GB 2312-1980).
It is a multi-byte encoding using
1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same
assignments as in GBK, which itself is a superset of GB 2312-1980.
There are about 1.6 million valid byte sequences.
It is not possible to determine if a byte sequence is either 2 or 4 bytes long
by just examining the lead byte — the second byte must be examined as well.
The Chinese Government has
mandated that all applications released on or after 2001-Sep-01 must support GB
18030.
The specification refers directly
to a mapping of GB 18030 codes to and from Unicode to define most character
assignments. Some characters that used to be mapped for GBK to the PUA (Private
Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and their mappings
from GB 18030 use only the Unicode 3.0 code points.
In addition, GB 18030 defines
roundtrip mappings for all 1.1 million Unicode code points including unassigned
and non-character ones, but excluding single surrogates. This makes GB 18030
functionally very similar to a UTF.
A Unicode mapping table for GB
18030 in XML format
is available from the ICU
website (.xml and .zip) and from Mark Davis' website (.zip only).
An article with more details and
with implementation suggestions is available on the developerWorks
site.