Re: surrogate at java's property file

From: Yung-Fong Tang (ftang@netscape.com)
Date: Wed Oct 03 2001 - 19:16:49 EDT


Brian Beck:
What do you think ?

"Addison Phillips [wM]" wrote:

> Java doesn't define any characters beyond Unicode 2.1.8 at the moment. It's
> stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of these
> versions have defined characters in the supplemental planes.
>
> In Java, a java.lang.Character object is closely tied to the definition of
> an "int", the 16-bit numeric type. Many classes and objects make no
> distinction (or worse, conflate a character with an int---many methods are
> defined to take and return ints for "Characters"). As a result, the Java
> character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
> surrogate character *is* recognized to be a surrogate, but a high-low pair
> is not recognized as representing a character, nor can you retrieve the
> character properties of the matched pair.
>
> So to property files. The java.lang.Character sequence U+D800 U+DC00 is
> represented by the sequence "\ud800\udc00". This sequence does NOT represent
> U+10000. It represents TWO Characters, which happen to be surrogates that
> form a valid pair. I should point out that Java is slightly clever. For
> example, the UTF-8 converter knows that U+D800 U+DC00 represents the scalar
> value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80 (and
> vice versa, of course).
>
> However, it is unclear how Unicode 3.1 support is going to make it into JDK
> 1.4++. The APIs are going to have to change to support the supplemental
> planes and the ripple effects on various APIs seems like an interesting
> problem. Perhaps they'll redefine an int to be a 32-bit value and switch
> Java to UTF-32 (yeah, sure.....)
>
> Best Regards,
>
> Addison
>
> Addison P. Phillips
> Globalization Architect / Manager, Globalization Engineering
> webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
> +1 408.962.5487 (phone) +1 408.210.3569 (mobile)
> -------------------------------------------------
> Internationalization is an architecture. It is not a feature.
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Yung-Fong Tang
> Sent: Monday, October 01, 2001 5:10 PM
> To: unicode@unicode.org
> Subject: surrogate at java's property file
>
> Any one know how does Java handle Surrogate pair property file ?
>
> Java's property file use the \u encoding for non ASCII characters,
> therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
> Surrogate Pair?
>
> Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I
> think it should be \u10000) or they cannot handle them at all ?



This archive was generated by hypermail 2.1.2 : Wed Oct 03 2001 - 18:06:18 EDT