From: Doug Ewell (dewell@adelphia.net)
Date: Mon Oct 14 2002 - 01:39:57 EDT
<violet@time.net.my> wrote:
> I am trying to write an application that can read input in Tradisional
> Chinese but output (printout on papers)in Simplified Chinese, without
> any 3rd party software (e.g. ChineseStar, TwinBridge).
>
> How can I implement Unicode in the coding? The programming language
> I'm using is Ms Visual Basic 6 Professional Edition.
It depends on how much of the problem you want to solve. Mapping
between Traditional Chinese (TC) and Simplified Chinese (SC) is *not*
generally 1-to-1, despite what many people believe. It could be
1-to-many, many-to-1, or even many-to-many, depending on which
character(s) are involved.
Some TC characters have different SC "equivalents" depending on which
meaning of the word is intended. And not every TC character ever
invented has an SC equivalent. There is even at least one character A
that is both the traditional form of some character B *and* the
simplified form of another character C!
TC/SC equivalence in the general case is a linguistic problem. The
Unicode Standard is a character encoding standard, not a linguistic
standard, so it does not attempt to provide definitive TC/SC mapping
tables. The official Unicode Han database:
http://www.unicode.org/Public/UNIDATA/Unihan.txt
does include fields called "kSimplifiedVariant" and
"kTraditionalVariant," which may be of some assistance. But as you will
see, only 2629 "simplified variants" and 2554 "traditional variants" are
listed, for tens of thousands of Han characters.
A group of mainland Chinese and Taiwanese industry specialists have
tried (unsuccessfully) to establish a TC/SC conversion layer within the
forthcoming internationalized domain name (IDN) architecture. Their
document includes a list of about 2000 1-to-1 TC/SC pairs taken from
official Chinese and Taiwanese references. It explicitly does not
propose a solution for the non-1-to-1 conversion cases, but dismisses
these cases as uncommon. The document (draft-ietf-idn-tsconv-02.txt)
has expired from the IETF timetable and is no longer available, but I
can supply a copy if you are still interested.
Of course, if you already have the TC/SC conversion module and just need
to convert between a DBCS encoding (e.g. GB 2312) in order to "implement
Unicode in the coding," the Unihan.txt file does include these mappings.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.5 : Mon Oct 14 2002 - 02:18:59 EDT