RE: Java's version of UTF-8

From: Addison Phillips (AddisonP@simultrans.com)
Date: Wed Nov 18 1998 - 12:52:06 EST

Next message: Lori Brownell: "RE: converters"
Previous message: Alain LaBont\i: "Re: A Search for Exemplary Sentences"
Maybe in reply to: Doug Ewell: "Java's version of UTF-8"
Next in thread: Rick McGowan: "Re: Java's version of UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Aw, c'mon Steve, it's not all that bad ;-)). The main problem that poor
old Java has is that under all that nifty object orientation lives what
amounts to a C string with a NULL at the end. But Java is using this
encoding INTERNALLY, as I think someone else pointed out in this thread
somewhere. External representation still has to go to the code
page/character set of the device... even UTF-8 and Unicode text has to
be normalized or it'll be (potentially) "moji-bake" somewhere along the
way.

Thankfully the smart folks at Sun have provided plenty of solid I18N
structure to help automate this and good Java coding practices lead
through calls to locale aware methods (cf. John O'Connor's recent
article in, I think it was, Multilingual)

Addison

PS> Next time we really ought to go have that beer!

-----Original Message-----
From: stephen_holmes@lionbridge.com
[mailto:stephen_holmes@lionbridge.com]
Sent: Wednesday, November 18, 1998 2:48 AM
To: Unicode List
Subject: RE: Java's version of UTF-8

I don't know what the general concensus on this is at present, but
doesn't it
begin to undermine Unicode as a credible standard?

As I understand it, this could cause any number of conversion issues,
particularly for clients with, say, client/server systems using both
Win32 and
Java clients, each expecting a UTF-8 stream with their version of
"correctness".

Will we be in a position where we'll need something like a special set
of
Unicode control characters to determine whether it's one of a set of
UTF-8
encodings or another?

Just a thought...

Steve.

-----Original Message-----
From: <unicode@unicode.org >
Sent: 18 November 1998 01:32
To: Unicode List <unicode@unicode.org>
Subject: Re: Java's version of UTF-8

As I understand it, Java's UTF-8 also differs from standard UTF-8 in
that
surrogate-pairs are not encoded using 4 bytes, but rather that they are
encoded using 6 bytes (one group of 3 bytes for each of the pair), i.e.
Java UTF-8 treats each the two elements of surrogate pairs just as it
treats any other character whose code is greater than U+07ff.

David Batchelor

______________________________ Reply Separator
_________________________________
Subject: Java's version of UTF-8
Author: <unicode@unicode.org> at symb-internet
Date: 17/11/98 22:52

I would like to know if any Java experts on the list can

(1) confirm for me that Java's version of UTF-8 differs only in
encoding U+0000 as { C0 80 } rather than { 00 }, and

(2) explain why it was necessary for Java to break the standard
to ensure that every character, EVEN THE NULL CHARACTER, be
encoded without the use of the null character.

Thanks in advance,

-Doug

Next message: Lori Brownell: "RE: converters"
Previous message: Alain LaBont\i: "Re: A Search for Exemplary Sentences"
Maybe in reply to: Doug Ewell: "Java's version of UTF-8"
Next in thread: Rick McGowan: "Re: Java's version of UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT