The Further Pitfalls and Complexities of Chinese to Chinese ConversionThomas R. Emerson - Basis Technology Corporation & Jack Halpern - CJK Dictionary Publishing Society
It is understood that Unicode provides an effective pivot when transcoding between legacy CJK encodings. However, converting between Chinese encodings and character sets (e.g., from GB2312 to Big Five) requires more work than merely mapping code-points as the correspondence between GB2312 and Big Five is one to many; a simple mapping table is not sufficient. In a paper presented at IUC 14[1], Jack Halpern and Jouni Kerman presented an in-depth analysis of the difficulties in accurately converting between Simplified Chinese (SC), used in the People's Republic of China and Singapore, and Traditional Chinese (TC), used in Taiwan, Hong Kong, and Macau. They discussed four progressively more accurate levels of conversion and described the lexical data necessary to achieve each conversion level. This paper presents a new collection of pitfalls and complexities that we have encountered over the last eighteen months, including:
We discuss various approaches to addressing these problems while providing a detailed discussion of the importance of Chinese to Chinese conversion in effective information retrieval. In so doing we argue that this problem can be viewed as a machine translation task as well as a transcoding task. We also contrast our approach with that presented by Liu et al. at IUC 7.[2] References
|
||||
When the world wants to talk, it speaks Unicode |
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
to info@global-conference.com.
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. 18 Jun 2000, Webmaster |