On Thu, 10 Jan 2002, Ken Krugler wrote:
> I've got GBK-encoded text that contains a number of Traditional Hanzi
> characters. I'd like to convert all of these to their Simplified
> equivalents. So does anybody know of a GBK table that maps each
> Traditional form to its Simplified form?
If converting to "simplified equivalents" means reducing the text so that
it can be representable in GB2312, then I'd recommend:
1) If the GBK character is in GB2312, keep it as-is.
2) Otherwise, convert to Big5 using Unicode as an intermediary. Take
the characters that converted to Big5 successfully and use one of
those many Big5->GB2312 converters as suggested by Frank Tang,
which will perform the traditional->simplified conversion.
3) If there are any characters that weren't handled by step #2 (e.g.,
traditional Chinese characters not in Big5[1]; traditional Chinese
characters in Big5 but not treated by most Big5->GB2312 converters[2];
non-Chinese characters used in Japanese[3]/Korean since the source text
*is* GBK), then probably turning them and the surrounding context
over to a human with access to a number of good dictionaries would
probably be the best way to (hopefully) find a "best fit" within
the circumstances (e.g., if it happens to be a variant of a
character that is in GB2312[4]). If even that fails, perhaps the
character in question can be described graphically ala "A+B"[5] or
the text in question rewritten[6].
[1] e.g., U+5700 (GBK 0x87F3) is a variant form of guo2 'country' that
is not in Big5, but one can substitute U+56FD (GB2312 0xB9FA),
the form of guo2 'country' used in simplified Chinese.
[2] e.g., U+5187 (GBK 0x83D3) is in Big5, used primarily to write mou
'not' in Cantonese (but other meanings also exist), but I haven't seen
a converter to GB2312 yet that'll substitute U+65E0 (GB2312 0xCEDE),
a near-synonym and etymologically-related character.
[3] e.g., U+7A93 (GBK 0xB799) is a Japanese form of chuang1 'window', but
one can substitute U+7A97 (GB2312 0xB4B0).
[4] See [1], [2], [3].
[5] i.e., as the combination of its components.
[6] e.g, U+72C6 (GBK 0xA0F0) occurs in Big5 and in most Chinese texts
encountered, it means 'Japanese spaniel dog; Japanese Chin' (and not
a pejorative ethnonym), which'll have to be rewritten to whatever
phrasing that dog breed goes under in GB2312 simplified Chinese texts.
Thomas Chan
tc31@cornell.edu
This archive was generated by hypermail 2.1.2 : Thu Jan 10 2002 - 22:07:47 EST