RE: japanese xml

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Aug 30 2001 - 04:16:21 EDT


Viranga Ratnaike wrote:
> Is it ok for Unicode code points to be
> encoded/serialized using EUC?
> I'm not planning on doing this; just wondering what (?if any?)
> restrictions, there are on choice of transformation format.

EUC size simply doesn't fit Unicode.

Each EUC-encoded character is either a single byte or a sequence of *two*
bytes. Each byte in a double-byte character is a non-ASCII code (range
128..255). So, even assuming that the whole range 128..255 is assigned to
double byte encoding, EUC allows a maximum of only 16,384 characters (128 x
128). But Unicode has 1,114,112 code points...

On the other hand, Unicode's UTF-8 and UTF-16 are both able to represent at
least the whole 1,114,112 character range. So, in *theory*, they could also
be used to represent CJK character sets, which normally have less than
10,000 characters. But, in *practice*, I have never seen such a thing.

> Is the conversion from euc-jp to utf-8/utf-16 simple; are there
> algorithms and/or converters, out there, that I can access?

Such a conversion requires three steps:

1) decode EUC byte sequences into JIS code points (i.e. get one integer for
each character);
2) convert JIS code points to Unicode code points;
3) encode Unicode code points into UTF-8 byte sequences (or UTF-16 word
sequences).

Steps 1 and 3 are very simple and totally algorithmic. Step 2 is more
complex, and requires looking up some sort of "dictionary" or conversion
table.

There are many free implementations of such converters available on the web.
One of the first places to look at for such things is the open source ICU
library (go on IBM site and search "ICU" or "Unicode").

_ Marco



This archive was generated by hypermail 2.1.2 : Thu Aug 30 2001 - 05:45:54 EDT