Sun's Java encodings vs IANA's character set registry

From: Mike Brown (mbrown@webb.net)
Date: Wed Apr 11 2001 - 20:59:48 EDT


In an effort to determine the extent to which character sets that might be
used on the Internet can be handled by software relying on the native
character encoding handling of Sun's J2EE platform, I am making a table that
correlates the names and aliases from the IANA's registry of character sets
[1] with the canonical names of character encodings that are supported by
Sun's international implementation of J2SE 1.3 [2].

 [1] http://www.isi.edu/in-notes/iana/assignments/character-sets
 [2] http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html

There are a number of encodings in the Sun implementation that I am fairly
certain do not have corresponding listings in the IANA registry. These are
listed below:

Cp737
Cp838
Cp856
Cp875
Cp921
Cp922
Cp930
Cp933
Cp935
Cp937
Cp939
Cp942
Cp942C
Cp943
Cp943C
Cp948
Cp949
Cp949C
Cp950
Cp964
Cp970
Cp1006
Cp1025
Cp1046
Cp1097
Cp1098
Cp1112
Cp1123
Cp1124
Cp1140
Cp1141
Cp1142
Cp1143
Cp1144
Cp1145
Cp1146
Cp1147
Cp1148
Cp1149
Cp1381
Cp1383
Johab
MS874
MS932
MS936
MS949
MS950
MacArabic
MacCentralEurope
MacCroatian
MacCyrillic
MacDingbat
MacGreek
MacHebrew
MacIceland
MacRomania
MacSymbol
MacThai
MacTurkish
MacUkraine
UnicodeBigUnmarked
UnicodeLittleUnmarked

If you see anything in the above list that looks funny, let me know.

I am including UnicodeBigUnmarked and UnicodeLittleUnmarked because, while
these could fall under the IANA's "UTF-16", they may not always. That is,
there is no guarantee of success when decoding a "UTF-16" (in IANA
terminology) byte stream using one of these Java encodings. Therefore, I
don't consider them equivalent to "UTF-16" for the purposes of my table.

I believe:

UTF-16 = UTF-16
UTF-16BE = UnicodeBig
UTF-16LE = UnicodeLittle

...although it would not surprise me to find out that this is incorrect.
Opinions appreciated.

There are a number of encodings in the Sun implementation that I think may
have corresponding listings in the IANA registry. These are listed below:

Cp874 (="IBM-Thai"?)
Cp33722
EUC_CN
EUC_JP
EUC_KR
EUC_TW
GBK
ISO2022CN
ISO2022CN_CNS
ISO2022CN_GB
ISO2022JP
ISO2022KR
MacRoman (="macintosh"?)
JIS0201
JIS0208
JIS0212
JISAutoDetect
SJIS

I cannot determine, based on the information in the referenced documents,
whether these are equivalent to anything in the IANA registry. I suspect
that some of them are. If anyone can shed light on this for me, I'd
appreciate it immensely.

The table will of course be publicly available when it is done.
A rough draft (136K, saved from MS Excel) is at
http://skew.org/xml/charsets/
But don't bookmark that URL.

Thanks, and please reply privately unless discussing the Unicode encodings.

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at My XML/XSL resources:
webb.net in Denver, Colorado, USA http://skew.org/xml/



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT