RE: GBK Traditional to Simplified mapping table

From: Tom Emerson (tree@basistech.com)
Date: Fri Jan 11 2002 - 15:36:31 EST


Marco Cimarosti writes:
> Doug Ewell wrote:
> > This is the opinion of many experts within, as well as
> > outside, the Unicode standardization effort, and it is
> > the reason you will not find a Unicode TC/SC mapping
> > table.
>
> Actually, such an table can easily be extracted from Unicode's UniHan
> database (a huge file: <http://www.unicode.org/Public/UNIDATA/Unihan.txt>).
>
> The relevant information for TC->SC is field <kSimplifiedVariant>, and for
> SC->TC is field <kTraditionalVariant>.

Be careful doing this. The kTraditionalVariant field was created based
on the GB/T 12345-1990 mappings, not Big Five. The 12345 mappings are
sometimes different than those used in Taiwan. For example, U+8C25
(GB0 58-54) maps to U+8B1A in GB/T 12345-1990 but U+8AE1 in Big Five.

> It can also turn useful for implementing things such as a user-friendly
> search function, that would match any variant of the sought characters. In
> this respect, UniHan contains two more fields that may be useful:
> <kSemanticVariant>, <kSpecializedSemanticVariant>.

There was a paper presented at the 2nd Chinese Language Processing
Workshop during ACL 2000 that took advantage of this for doing
searches not only within Chinese documents, but also on Japanese
documents.

My presentation from IUC20 on Pan-China search talks about the
multitude of issues in doing this. Also, the paper that Jack Halpern
and I presented at IUC 17 (I think) also describes some of these
issues.

Correctly converting between SC and TC is a non-trivial undertaking,
one that cannot be solved through lookup tables alone.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



This archive was generated by hypermail 2.1.2 : Fri Jan 11 2002 - 15:10:10 EST