Re: Java and Unicode

From: Elliotte Rusty Harold (elharo@metalab.unc.edu)
Date: Thu Nov 16 2000 - 09:11:22 EST


At 4:44 PM -0800 11/15/00, Markus Scherer wrote:

>In the case of Java, the equivalent course of action would be to
>stick with a 16-bit char as the base type for strings. The int type
>could be used in _additional_ APIs for single Unicode code points,
>deprecating the old APIs with char.
>

It's not quite that simple. Many of the key APIs in Java already use
ints instead of chars where chars are expected. In particular, the
Reader and Writer classes in java.io do this.

I do agree that it makes sense to use strings rather than characters.
I'm just wondering how bad the transition is going to be. Could we
get away with eliminating (or at least deprecating) the char data
type completely and all methods that use it? And can we do that
without breaking all existing code and redesigning the language?

For example, consider the charAt() method in java.lang.String:

public char charAt(int index)

This method is used to walk strings, looking at each character in
turn, a useful thing to do. Clearly it would be possible to replace
it with a method with a String return type like this:

public String charAt(int index)

The returned string would contain a single character (which might be
composed of two surrogate chars). However, we can't simply add that
method because Java can't overload on return type. So we have to give
that method a new name like:

public String characterAt(int index)

OK. That one's not too bad, maybe even more intelligible than what
we're replacing. But we have to do this in hundreds of places in the
API! Some will be much worse than this. Is it really going to be
possible to make this sort of change everywhere? Or is it time to
bite the bullet and break backwards compatibility? Or should we
simply admit that non-BMP characters aren't that important and stick
with the current API? Or perhaps provide special classes that handle
non-BMP characters as an ugly-bolt-on to the language that will be
used by a few Unicode afficionados but ignored by most programmers,
just like wchar is ignored in C to this day?

None of these solutions are attractive. It may take the next
post-Java language to really solve them.

-- 

+-----------------------+------------------------+-------------------+ | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer | +-----------------------+------------------------+-------------------+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +----------------------------------+---------------------------------+ | Read Cafe au Lait for Java news: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML news: http://metalab.unc.edu/xml/ | +----------------------------------+---------------------------------+



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT