RE: accessing extended ranges

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Tue Mar 26 2002 - 11:38:57 EST


Hi Ben,

The short answer is: you don't.

Java doesn't support characters outside the BMP (Basic Multilingual Plane) just yet. JDK 1.4 adds full support for Unicode 3.0, which includes a few more CJK characters, but not the 40,000 or so beyond U+FFFF.

That said, you can represent the characters as UTF-16 surrogate pairs (Java's internal representation is UTF-16). And some of the character converters will work properly (notably the UTF-8 converter). But as far as Java's concerned each surrogate code point is a separate character.

Vexingly, the folks at Javasoft haven't said how they'll implement support for Unicode 3.1 and later.

ICU4J, the IBM opensource project, provides some UTF-16 support capabilities that suggest a possible solution, but there are seemingly intractable problems with the Character class and char data type (luckily most APIs in Java take int arguments for characters instead of char). And it is pretty easy to build classes for processing these characters as surrogate pairs using the Unicode character database.

The downside is that the GUI stuff, Swing and AWT, don't recognize surrogates properly. Paste U+D800 U+DC00 into a Swing control and you'll see TWO hollow boxes, not one... the JDK is rendering the characters separately. (NB> I haven't tried this test with 1.4, so there may be more support there for surrogates).

So, using ICU you can probably do some of the processing you're interested in. But GUI apps are going to be very problematic until Swing or AWT are fixed.

Hope that helps.

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone) +1 408.210.3659 (mobile)
-------------------------------------------------
Internationalization is an architecture. It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Ben Monroe
> Sent: 2002年3月26日 0:17
> To: Unicode list
> Subject: accessing extended ranges
>
>
> I would like to access some of the characters from "CJK Unified Ideographs
> Extension B." These are all in the range of 20000-2A6DF. (direct link:
> http://www.unicode.org/charts/PDF/U20000.pdf )
>
> "Basic Latin" appears in 0000-007F range. The original "CJK Unified
> Ideographs" all appear within the 4E00–9FAF range. These are all easy to
> access with U+xxxx (4 x's). In Java, the format /uxxxx works just
> fine (and
> also the same for http://www.macchiato.com/unicode/ ). However, how do you
> access the characters in the larger ranges (ie, U+xxxxx or /uxxxxx)?
>
> Directly using the 5 value format /uxxxxx produces are Unicode character
> followed by the 5th x. Here is a quick example:
>
> public class UniStringTest {
> static public void main(String[] args) {
> String s1 = "\u963F"; // displays fine; standard /uxxxx (4x's)
> System.out.println(s1);
> String s2 = "\u9FA0"; // also displays fine; standard /uxxxx (4x's)
> System.out.println(s2);
> String s3 = "\u2A6A5"; // biggest character that I know (5x's) but
> doesn't process
> System.out.println(s3);
> }
> }
>
> I understand this isn't a programming ML, but I just used the Java program
> as an example.
> I'd appreciate some input.
> Thanks,
>
> Ben Monroe
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Mar 26 2002 - 12:25:21 EST