RE: GBK Traditional to Simplified mapping table

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Jan 11 2002 - 04:57:08 EST


Doug Ewell wrote:
> [...] Far from being a simple operation like Latin
> case mapping (to which it was compared), TC/SC
> requires potentially complex analysis of the text
> being converted.
>
> This is the opinion of many experts within, as well as
> outside, the Unicode standardization effort, and it is
> the reason you will not find a Unicode TC/SC mapping
> table.

Actually, such an table can easily be extracted from Unicode's UniHan
database (a huge file: <http://www.unicode.org/Public/UNIDATA/Unihan.txt>).

The relevant information for TC->SC is field <kSimplifiedVariant>, and for
SC->TC is field <kTraditionalVariant>.

As each field is on a separate line, the information can be extracted quite
simply, such as with the DOS command:

        find "kSimplifiedVariant" Unihan.txt > kSimplifiedVariant.txt

However, as Doug explained, this 1-to-1 data is NOT suitable for a
full-fledged conversion. However, the data may be a good starting point for
more complex approaches.

It can also turn useful for implementing things such as a user-friendly
search function, that would match any variant of the sought characters. In
this respect, UniHan contains two more fields that may be useful:
<kSemanticVariant>, <kSpecializedSemanticVariant>.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jan 11 2002 - 04:39:24 EST