Conversion Between HKSCS-2001 and Unicode -- Going Beyond Basic Multilingual PlaneLinus Toshihiro Tanaka - Oracle Corporation
There are two written Chinese languages well recognized in the computer industry. They are Simplified Chinese used primarily in Mainland China, and Traditional Chinese used primarily in Taiwan, Hong Kong, and Macau. Until mid-1990's, Traditional Chinese character sets were mostly those of Taiwan, but Hong Kong needs a number of characters not included in those character sets. Many of characters that Hong Kong needs were not included in Unicode until recently (Unicode3.1 included almost all of them). To handle those characters on computers, Hong Kong government (which is now Hong Kong S.A.R. government) had defined Government Common Character Set (GCCS) in 1995 based on Big-5 encoded character set. GCCS included around 3,000 extra characters over Big-5. In September 1999, Hong Kong S.A.R. government defined Hong Kong Supplementary Character Set (HKSCS) which is the successor of Government Common Character Set (GCCS). Unlike GCCS, HKSCS defined precise mapping between HKSCS and Unicode2.1, and between HKSCS and Unicode3.0. Oracle implemented HKSCS in 2000 with Unicode3.0 mapping. In December 2001, Hong Kong S.A.R. government defined Hong Kong Supplementary Character Set - 2001 (HKSCS-2001) with precise mapping between HKSCS-2001 and Unicode2.1, between HKSCS-2001 and Unicode3.0, and between HKSCS-2001 and Unicode3.1. Among HKSCS-2001 characters, 1,686 characters are not included in Unicode3.0, so Private Use Area (PUA) has to be used for them when using Unicode3.0. This number dramatically goes down, to only 35 characters, if we use Unicode3.1. This, however, means that we need to go beyond Basic Multilingual Plane (BMP) of Unicode, and have to handle 1,651 supplementary characters (also known as surrogate characters). In this paper, I list various issues when implementing HKSCS-2001, explain the current implementation in Oracle, and discuss about the future. |
When the world wants to talk, it speaks Unicode |
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
to info@global-conference.com.
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. 12 December 2002, Webmaster |