Disclaimer: this is only my interpretation of GB 18030. Use at your own 
risk.
GB 18030 can be three different things, depending on how you interpret it:
   1. it is a coded character set, defined by the glyph pictures in the
      published standard. That collection does not include characters
      for the so-called minority scripts (e.g. Mongolian)
   2. it is a coded character set made of 1 + the minority scripts (you
      see that by reading the - a?- document that describes the
      certification testing)
   3. it is roughly a UTF, the most notable deviations being that it can
      represent a bit more than 0x0 - 0x10ffff and allows the surrogate
      code points.
Under interpretations 1 and 2, you also get a mapping between those 
collections and Unicode. Except for 25 characters, they are all mapped 
to non-PUA BMP scalar values. The remaining 25 are mapped to PUA BMP 
scalar values. Some of those 25 characters are believed to be in the 
Unicode repertoire (e.g. GB+FE51 is mapped to U+E816, and is believed to 
be U+20087).
The duality collection/encoding form is in my opinion the most painful 
aspect. In particular, it makes the publication of a new mapping (e.g. 
to a different version of Unicode, as HKSCS did to take into account 
newly encoded Unicode characters) very problematic.
By the way, here are a couple of things that may be of interest. HK+ 
means HKSCS code point; GB+ means GB 18030 code point:
   1. PUA confusion:
      HK+9571  maps to U+2721B under the 3.2 mapping (and is an ideograph)
      HK+9571 maps to U+E78D under the 3.0 mapping
      GB+A6D9 maps to U+E78D.
      GB+A6D9 is definitely is not an ideograph.
   2. PUA differentiation:
      HK+8BFA maps to U+20087 under the 3.2 mapping
      HK+8BFA maps to U+F572 under the 3.0 mapping
      GB+FE51 maps to U+E816
      GB+FE51 is believed to be U+20087
Eric.
This archive was generated by hypermail 2.1.2 : Fri Jul 19 2002 - 13:30:52 EDT