GB 18030 is a new Chinese
codepage standard that extends GB 2312-1980 and GBK (which itself is an
extension of GB 2312-1980).
It is a multi-byte encoding using
1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same
assignments as in GBK, which itself is a superset of GB 2312-1980.
There are about 1.6 million valid byte sequences.
It is not possible to determine if a byte sequence is either 2 or 4 bytes long
by just examining the lead byte — the second byte must be examined as well.
The Chinese Government has
mandated that all applications released on or after 2001-Sep-01 must support GB
18030.
The specification refers directly
to a mapping of GB 18030 codes to and from ISO 10646/Unicode to define most
character assignments. Some characters that used to be mapped for GBK to the
PUA (Private Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and
their mappings from GB 18030 use only the Unicode 3.0 code points.
In addition, GB 18030 defines
roundtrip mappings for all 1.1 million Unicode code points including unassigned
and non-character ones, but excluding single surrogates. This makes GB 18030
functionally very similar to a UTF.
China has confirmed in
discussions with major IT companies that it is sufficient to be able to
According to current
understanding, this means that processes can use ISO 10646/Unicode internally
if they also provide conversion between GB 18030 and ISO 10646/Unicode. This is
possible because of the definition of GB 18030 with a mapping table to ISO
10646/Unicode.
A Unicode mapping table for GB
18030 in XML format
is available from the ICU
website (.xml and .zip).
Both GB 18030 and ISO 10646
define sets of "user" codes. The User-Defined Areas in GB 18030 do not
correspond 1:1 to Private-Use Areas in Unicode.
Some assigned characters are
mapped from 2-byte parts of GBK and GB 18030 to the Private-Use Area in the BMP
(U+E000..U+F8FF). A small portion of these mappings have changed between GBK
and GB 18030, and GB 18030 maps them instead to Unicode characters that were
introduced in Unicode 3.0.
The User-Defined Areas in the
2-byte parts of GBK and GB 18030 are mapped to other parts of the Private-Use
Area in the BMP. Note that all single-byte and 2-byte codes have
defined mappings — they must be mapped according to the standard table.
Similarly, GB 18030 maps all
remaining Unicode Private-Use code points to four-byte GB 18030 codes.
GB 18030 also provides a
User-Defined Area with 25200 four-byte codes, without specified mappings.
Normally, they need to be treated as unassigned codes.
There are some 460000 four-byte
codes that are reserved for future use and must be treated as unassigned codes
at this point.
As noted above, all Private-Use
code points are mapped to GB 18030 codes. This means that they can be exchanged
via GB 18030. In addition to the usual agreement about Private-Use characters
between processes exchanging them, one must take the GB 18030 assignments into
account when exchanging text in GB 18030.
GB 18030 assigns characters to
some of the codes corresponding to Private-Use BMP code points. All other such
codes are either User-Defined in GB 18030 or not specified other than through
the mapping correspondence.
An article with more details and
with implementation suggestions is available on the developerWorks
site.