Re: Java and Unicode

From: Jungshik Shin (jshin@pantheon.yale.edu)
Date: Wed Nov 15 2000 - 15:52:57 EST


On Wed, 15 Nov 2000, Thomas Chan wrote:

> On Wed, 15 Nov 2000, Jungshik Shin wrote:
>
> > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:
> > >
> > > Many people try to compare this to DBCS, but it really is not the same
> > > thing.... understanding lead bytes and trail bytes in DBCS is *astoundingly*
> > > more complicated than handling surrogate pairs.
> >
> > Well, it depends on what multibyte encoding you're talking about. In case
> > of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to
> > SJIS(Windows94?), Windows-949(UHC), Windows-950, WIndows-125x(JOHAB),
> > ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about
> > the same as UTF-16, I believe, especially in case of EUC-CN and EUC-KR)
>
> I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than
> KS X 1001 in it) to the "complicated" group because of the shifting bytes
> required to get to different planes/character sets.

Well, EUC-KR has never used character sets other than US-ASCII(or
its Korean variant KS X 1003) and KS X 1001 although a theoretical
possibilty is there. More realistic (although very rarely used. there
are only two known implementations :Hanterm - Korean xterm - and Mozilla
) complication for EUC-KR arises not from a third character set (KS X
1002) in EUC-KR but from 8byte-sequence representation of (11172-2350)
Hangul syllables not covered by the repertoire of KS X 1001.

As for EUC-JP(which uses JIS X 201/US-ASCII, JIS X 208 AND JIS X 0212)
and EUC-TW, I know what you're saying. That's exactly why I added at
the end of my prev. message 'especially in case of EUC-CN and EUC-KR'
:-) Probably, I should have written among multibyte encodings at least
EUC-CN and EUC-KR are as easy to handle as UTF-16.

Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT