RE: Java's version of UTF-8

From: Peter Westlake (peter@harlequin.co.uk)
Date: Wed Nov 18 1998 - 06:54:57 EST


At 02:47 1998-11-18 -0800, stephen_holmes@lionbridge.com wrote:
>
>
>I don't know what the general concensus on this is at present, but doesn't it
>begin to undermine Unicode as a credible standard?
>
>As I understand it, this could cause any number of conversion issues,
>particularly for clients with, say, client/server systems using both Win32 and
>Java clients, each expecting a UTF-8 stream with their version of "correctness".
>
>Will we be in a position where we'll need something like a special set of
>Unicode control characters to determine whether it's one of a set of UTF-8
>encodings or another?

I think someone has answered this before: Java uses its odd UTF-8
internally, and as a binary format for serializing objects. It does
not use it for I/O to the rest of the world. The internal format
doesn't matter, and the serialization format is only read by other
Java programs (because it is used to make Java objects persistent,
and they only have meaning to the Java VM). You can read and write
proper UTF-8 using the InputStreamReader reader class, by giving
a conversion method parameter:

InputStreamReader isr =
    new InputStreamReader(someInputStream, "UTF8");

Other conversion methods include "Unicode", which looks for a BOM,
and explictly big- and little-endian Unicode.

So I don't think Java has broken the standard at all.

Peter.

>-----Original Message-----
>From: <unicode@unicode.org >
>Sent: 18 November 1998 01:32
>To: Unicode List <unicode@unicode.org>
>Subject: Re: Java's version of UTF-8
>
>
>As I understand it, Java's UTF-8 also differs from standard UTF-8 in that
>surrogate-pairs are not encoded using 4 bytes, but rather that they are
>encoded using 6 bytes (one group of 3 bytes for each of the pair), i.e.
>Java UTF-8 treats each the two elements of surrogate pairs just as it
>treats any other character whose code is greater than U+07ff.
>
>David Batchelor
>
>
>______________________________ Reply Separator _________________________________
>Subject: Java's version of UTF-8
>Author: <unicode@unicode.org> at symb-internet
>Date: 17/11/98 22:52
>
>
>I would like to know if any Java experts on the list can
>
>(1) confirm for me that Java's version of UTF-8 differs only in
> encoding U+0000 as { C0 80 } rather than { 00 }, and
>
>(2) explain why it was necessary for Java to break the standard
> to ensure that every character, EVEN THE NULL CHARACTER, be
> encoded without the use of the null character.
>
>Thanks in advance,
>
>-Doug
>
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT