Doug Ewell wrote:
> [...] Far from being a simple operation like Latin
> case mapping (to which it was compared), TC/SC
> requires potentially complex analysis of the text
> being converted.
>
> This is the opinion of many experts within, as well as
> outside, the Unicode standardization effort, and it is
> the reason you will not find a Unicode TC/SC mapping
> table.
Actually, such an table can easily be extracted from Unicode's UniHan
database (a huge file: <http://www.unicode.org/Public/UNIDATA/Unihan.txt>).
The relevant information for TC->SC is field <kSimplifiedVariant>, and for
SC->TC is field <kTraditionalVariant>.
As each field is on a separate line, the information can be extracted quite
simply, such as with the DOS command:
find "kSimplifiedVariant" Unihan.txt > kSimplifiedVariant.txt
However, as Doug explained, this 1-to-1 data is NOT suitable for a
full-fledged conversion. However, the data may be a good starting point for
more complex approaches.
It can also turn useful for implementing things such as a user-friendly
search function, that would match any variant of the sought characters. In
this respect, UniHan contains two more fields that may be useful:
<kSemanticVariant>, <kSpecializedSemanticVariant>.
_ Marco
This archive was generated by hypermail 2.1.2 : Fri Jan 11 2002 - 04:39:24 EST