Converting Big5 to Unicode

From: Tom Emerson (tree@basistech.com)
Date: Tue Nov 07 2000 - 11:06:47 EST


Viswanathan S writes:

> When i compare the output of my conversion ( Big5 to Unicode ) with
> the conversion used by IE 5.0 the result appears to be different .

Microsoft uses CP950, which is an extension to Big 5. You can find
mapping tables for CP950 in various places:

http://www.microsoft.com/globaldev/reference/dbcs/950.htm
http://oss.software.ibm.com/icu/charset/CharMaps-XML/windows-950-2000.xml
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT

Of these the ICU table is probably the best (particularly because it
is in XML and more easily parsed than the Microsoft tables.) It was
generated programmatically from the tables actually shipping in
Windows 2000. In the past there were discrepancies between the
published CP950 and the reality, though this has probably been fixed.

The Consortium's mapping table does not include support for
round-tripping seven code points; the Microsoft table does map these,
and I believe they support round-tripping via the PUA, but I haven't
tried it myself.

> My doubt is : Is the Big5 to Unicode mapping table updated after
> 1994 . If yes where can i find the latest mapping between Big5 to
> Unicode ?

I don't know the history of the Consortium's mapping table, but Big
Five itself has not changed since it was released: extensions have
been made (such as ETen, GCCS, HKSCS, HKUST EUDC, CP-950, and Big 5+)
that are not reflected in the UC tables, but this isn't the issue
here.

You can download the Big 5+ mapping tables from

http://www.cmex.org.tw/big-5.html

You can then extract the Big 5 ideographic mappings from there (i.e.,
all codepoints between 0xA440 - 0xC67E) and compare them with
Microsoft's and the Consortium's, but that is a fair amount of work.

What I would suggest you do is this: convert documents with both IE
and your code and find out what code-points are different and then
cross-reference the Consortium's table and Microsoft's table and
figure out exactly what the differences are.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Zenkaku Language Hacker                            http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT