From: Ken Krugler (ken@transpac.com)
Date: Tue Apr 05 2005 - 16:09:49 CST
I'm trying to generate a fairly complete mapping between these two
legacy encodings, where fuzzy equivalence is OK (and preferable to no
mapping).
I've been using various .ucm files from ICU, as well as the
UniHan.txt file (for Simplified & Traditional variants).
This has worked reasonably well for GBK->Big-5+HKSCS, as expected.
Out of the 7601 characters in GBK that I've got glyph data for, only
268 can't be mapped. I could whittle this down a bit by using
mappings suggested by the cross reference data found in
NamesList.txt, though each would have to be hand-verified.
For Big-5+HKSCS->GBK, the situation isn't so great. Out of the 18275
characters in Big-5+HKSCS that I've got glyph data for, 2162 can't be
mapped. Most of these (1598) are HKSCS characters that map to U+2xxxx
code points.
So does anybody know of such a mapping table that already exists, or
a suggestion for how to fuzzily resolve a significant number of the
remaining unmapped HKSCS? I'm pretty sure somebody else has wrestled
with this same problem.
And yes, I realize this is a bit like trying to park a Cadillac in a closet :)
Thanks,
-- Ken
-- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200
This archive was generated by hypermail 2.1.5 : Tue Apr 05 2005 - 16:12:23 CST