RE: surrogate at java's property file

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Thu Oct 04 2001 - 23:37:16 EDT


Carl,

Well I'm not too concerned about it. I know (heck, *you* know) the guys over
there. They've done good work to date. I don't doubt they have a solution up
their collective sleeves.

In fact, the problem is basically that no matter which path they pick
(UTF-16 or UTF-32), the Character and String class methods will have to be
changed to deal with it. At present, many of the Character class methods
(and, of course, many aspects of J2SE based on these methods) rely on a
relationship to char or a single "code unit" in a String---e.g. a UCS-2 code
unit. A "simple" solution might be to redefine Character as UTF-32 (Scalar
Value based) and keep UTF-16 for Strings... but then there are various
access methods in String that would have to be deprecated or replaced. Yuck.

In fact, I suspect (based on some evidence such as presentations at IUC17)
that the basic plumbing (data tables) is in place in 1.4. It'll be
interesting to see how they solve the problem. It's not for lack of asking
that I don't know the answer ;-)

The flip side of all this is the compatibility issue. For example,
properties files are tied to UTF-16. Interoperability with JDK 1.x products
will depend on (as far as I can tell) a UTF-16 implementation. I'm not sure
they *can* change to UTF-32 at this point. Anyhow, at this point all this
curiosity is academic. The earliest we'll see Unicode 3.1 support in Java
appears to be the next release beyond 1.4, or about a year from now. We'll
know by then what solution has been adopted. It should be interesting.

Best Regards,

Addison

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Carl W. Brown
Sent: Thursday, October 04, 2001 5:37 PM
Cc: unicode@unicode.org
Subject: RE: surrogate at java's property file

Addison,

It might be easier to convert the JVM from UCS-2 to UTF-32 so that you do
not have to worry about surrogates. This would more closely match most Unix
implementations (except Sun) where Java is widely used.

Carl

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Addison Phillips [wM]
> Sent: Wednesday, October 03, 2001 4:31 PM
> To: Yung-Fong Tang
> Cc: unicode@unicode.org; bcbeck@eng.sun.com
> Subject: RE: surrogate at java's property file
>
>
> No fair! You forgot to quote my disclaimer in the next email for my big
> boo-boo regarding what an int is in Java. An int is fine, darnit!
> It's char
> that was originally (at least externally) limited to 16-bits. Of course,
> many APIs use ints, which don't present a problem. But java.lang.Character
> and java.lang.String would have to change internal representation or add
> methods or something to allow surrogate pairs to be evaluated.
>
> Addison
>
> -----Original Message-----
> From: Yung-Fong Tang [mailto:ftang@netscape.com]
> Sent: Wednesday, October 03, 2001 4:17 PM
> To: Addison Phillips [wM]
> Cc: unicode@unicode.org; bcbeck@eng.sun.com
> Subject: Re: surrogate at java's property file
>
>
> Brian Beck:
> What do you think ?
>
> "Addison Phillips [wM]" wrote:
>
> > Java doesn't define any characters beyond Unicode 2.1.8 at the moment.
> It's
> > stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of
> these
> > versions have defined characters in the supplemental planes.
> >
> > In Java, a java.lang.Character object is closely tied to the
> definition of
> > an "int", the 16-bit numeric type. Many classes and objects make no
> > distinction (or worse, conflate a character with an int---many
> methods are
> > defined to take and return ints for "Characters"). As a result, the Java
> > character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
> > surrogate character *is* recognized to be a surrogate, but a
> high-low pair
> > is not recognized as representing a character, nor can you retrieve the
> > character properties of the matched pair.
> >
> > So to property files. The java.lang.Character sequence U+D800 U+DC00 is
> > represented by the sequence "\ud800\udc00". This sequence does NOT
> represent
> > U+10000. It represents TWO Characters, which happen to be
> surrogates that
> > form a valid pair. I should point out that Java is slightly clever. For
> > example, the UTF-8 converter knows that U+D800 U+DC00 represents the
> scalar
> > value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80
> (and
> > vice versa, of course).
> >
> > However, it is unclear how Unicode 3.1 support is going to make it into
> JDK
> > 1.4++. The APIs are going to have to change to support the supplemental
> > planes and the ripple effects on various APIs seems like an interesting
> > problem. Perhaps they'll redefine an int to be a 32-bit value and switch
> > Java to UTF-32 (yeah, sure.....)
> >
> > Best Regards,
> >
> > Addison
> >
> > Addison P. Phillips
> > Globalization Architect / Manager, Globalization Engineering
> > webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
> > +1 408.962.5487 (phone) +1 408.210.3569 (mobile)
> > -------------------------------------------------
> > Internationalization is an architecture. It is not a feature.
> >
> > -----Original Message-----
> > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> > Behalf Of Yung-Fong Tang
> > Sent: Monday, October 01, 2001 5:10 PM
> > To: unicode@unicode.org
> > Subject: surrogate at java's property file
> >
> > Any one know how does Java handle Surrogate pair property file ?
> >
> > Java's property file use the \u encoding for non ASCII characters,
> > therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
> > Surrogate Pair?
> >
> > Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I
> > think it should be \u10000) or they cannot handle them at all ?
>
>
>



This archive was generated by hypermail 2.1.2 : Thu Oct 04 2001 - 22:11:30 EDT