Re: Java char and Unicode 3.0+ (was:Canonical equivalence in rendering: mandatory or recommended?)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 15 2003 - 14:50:25 CST


From: "Nelson H. F. Beebe" <beebe@math.utah.edu>
To: "Philippe Verdy" <verdy_p@wanadoo.fr>
Cc: <beebe@math.utah.edu>; "Jill Ramonsky" <Jill.Ramonsky@Aculab.com>
Sent: Wednesday, October 15, 2003 5:34 PM
Subject: Re: Canonical equivalence in rendering: mandatory or recommended?

> [This is off the unicode list.]
>
> Philippe Verdy wrote on the unicode list on Wed, 15 Oct 2003 15:44:44
> +0200:
>
> >> ...
> >> Looking at the Java VM machine specification, there does not
> >> seem to be something implying that a Java "char" is necessarily a
> >> 16-bit entity. So I think that there will be sometime a conforming
> >> Java VM that will return UTF-32 codepoints in a single char, or
> >> some derived representation using 24-bit storage units.
> >> ...
>
> I disagree: look at p. 62 of
>
> @String{pub-AW = "Ad{\-d}i{\-s}on-Wes{\-l}ey"}
> @String{pub-AW:adr = "Reading, MA, USA"}
>
> @Book{Lindholm:1999:JVM,
> author = "Tim Lindholm and Frank Yellin",
> title = "The {Java} Virtual Machine Specification",
> publisher = pub-AW,
> address = pub-AW:adr,
> edition = "Second",
> pages = "xv + 473",
> year = "1999",
> ISBN = "0-201-43294-3",
> LCCN = "QA76.73.J38L56 1999",
> bibdate = "Tue May 11 07:30:11 1999",
> price = "US\$42.95",
> acknowledgement = ack-nhfb,
> }
>
> where it states:
>
> >> * char, whose values are 16-bit unsigned integers representing Unicode
> >> characters (section 2.1)

Personnally I read this reference of the second edition of the Java VM:
http://java.sun.com/docs/books/vmspec/2nd-edition/html/VMSpecTOC.doc.html

Notably:
http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.htm
which states clearly that "char" is not a integer type, but a numeric type
without a constraint on the number of bits it contains (yes there has
existed JVM implementations with 9-bit chars!)

It states:
[quote]
    2.4.1 Primitive Types and Values

    A primitive type is a type that is predefined by the Java programming
language and named by a reserved keyword. Primitive values do not share
state with other primitive values. A variable whose type is a primitive type
always holds a primitive value of that type.(*2)
    [...]
    The integral types are byte, short, int, and long, whose values are
8-bit, 16-bit, 32-bit, and 64-bit signed two's-complement integers,
respectively, and char, whose values are 16-bit unsigned integers
representing Unicode characters (section 2.1).
    *2: Note that a local variable is not initialized on its creation and is
considered to hold a value only once it is assigned (section 2.5.1).
[/quote]

Then it defines the important rules for char conversions or promotions,
for arithmetic operations, assignments or method invokation:

[quote]
2.6.2 Widening Primitive Conversions [...]
* char to int, long, float, or double [...]
Widening conversions do not lose information about the sign or order of
magnitude of a numeric value. Conversions widening from an integral type to
another integral type do not lose any information at all; the numeric value
is preserved exactly. [...]
According to this rule, a widening conversion of a signed integer value to
an integral type simply sign-extends the two's-complement representation of
the integer value to fill the wider format. A widening conversion of a value
of type char to an integral type zero-extends the representation of the
character value to fill the wider format.
Despite the fact that loss of precision may occur, widening conversions
among primitive types never result in a runtime exception (section 2.16).
[/quote]

So a 'char' MUST have AT MOST the same number of bits as an 'int', i.e. it
cannot have more than 32 bits. If char was defined to have 32 bits, no
zero-extension would occur but the above rule would be still valid.

and:

[quote]
2.6.3 Narrowing Primitive Conversions [...]
* char to byte or short [...]
Narrowing conversions may lose information about the sign or order of
magnitude, or both, of a numeric value (for example, narrowing an int value
32763 to type byte produces the value -5). Narrowing conversions may also
lose precision.[...]
Despite the fact that overflow, underflow, or loss of precision may occur,
narrowing conversions among primitive types never result in a runtime
exception.
[/quote]

Yes this paragraph says that char to short may loose information. This
currently does not occur with unsigned 16-bit chars, but this could happen
safely with 32-bit chars without violating the rule.

However, as the sign must be kept when converting char to int, this means
that the new 32-bit char would have to be signed. Yes this means that there
would exist now negative values for chars but none of them are currently
used with 16-bit chars which are in the range [\u0000..\uFFFF] and promoted
as if they were integers in range [0..65535]. That's why a narrowing
conversion occurs when converting to short (the sign may change, even if no
bits are lost)...

The VM also allows assignment conversion to use narrowing conversions to
chars without causing a complie-time error or an exception at run-time.
(section 2.6.7)

Note that the 16bit indication in section 2.1 relates to the fact that it
was refering to the support of Unicode 2.1 which did not allocate any
character out of the BMP, and just reserved a range of codepoints for
surrogates. Until the UTF-16 specification was clarified by Unicode in a
later specification, this spec was valid, and it is still so now...

The Java language still lacks a way to specify a literal for a character out
of the BMP. Of course one can use the syntax '\uD800\uDC00' but this would
not compile with the current _compilers_, that expect only one char in the
literal. In a String literal "\uD800\uDC00" becomes the 4-bytes UTF-8
sequence for _one_ Unicode codepoint in the compiled class.
This is the class loader that will decode the UTF-8 sequence and build the
String object containing an unspecified number of chars!

So on a 16-bit-char VM the string "\uD800\uDC00" would be of length 2,
containing separate surrogates.
on a 32-bit-char VM it would be of length 1 containing only one char
'\U10000;'. If you extract the first char to store it in a short in your
application, you'll be assigning 0 by a narrowing conversion (and this is in
accordance with the VM spec and the language which stipulates explicitly
that you may loose information).

The good programmer practice is to extract 'char' values into 'int'
variables if one need to perform arithmetic, i.e. using widening conversions
which is the default for all arithmetic operations.

I have reread many times the VM spec in the past and still today, looking
for whever there was a violation if char was implemented to be now signed
and 32-bit wide, instead of unsigned and 16-bit wide, and I see nothing
against it.

Also, the VM spec (as well as more recent specs such as the Java Debugging
Interface) does not specify how String instances encode internally their
backing store: there's no requirement that this private field even uses
arrays of chars, and a valid VM could as well recode them with SCSU-8, UTF-8
like in compiled class files, or less probably even SCSU-8. The private
fields of String or StringBuffer instances are not documented and have
already changed in the past, so this is not even a problem for debuggers:
existing applications that use reflection to change the visibility of
private fields in undocumented classes are not guaranteed to be portable
across existing VMs.

In the anguage itself, portability is ensured by using either a
compile flag to specify the behavior of classes using char arithmetics or
conversions where widening truncations or a positive sign was expected. This
compile flag could set a currently unused flag in the .class file format, so
that classes could be compiled to run with the compatibility mode old
semantic or the new one.

(I bet that most correctly written code does not assume 16-bit truncation of
native chars nor their positive sign, but if not sure, there could be a
compiler flag to avoid using the extended semantics on a new VM that now
would have 32-bit chars)

Notes:

1. May be there should be a standard to specify codepoints of larger sizes
in the language (not in the JVM). I used '\Uxxx;' with an explicit ending
semi-colon instead of forcing all users to type in the full 8 hex digits for
each occurences. But some other languages or convensions use '\U010000'
where the uppercase \U requires 6 hex digits to enter all Unicode characters
in the standard range [\U000000..\U10FFFF]

2. The initial spec of UTF-32 and UTF-8 by ISO allowed much more planes with
31-bit codepoints, and may be there will be an agreement sometime in the
future between ISO and Unicode to define new codepoints out of the current
standard 17 first planes that can be safely converted with UTF-16, or a
mechanism will be specified to allow mapping more planes to UTF-16, but this
is currently not a priority as long as there remains unallocated space in
the BMP to define new types and ranges of surrogates for "hyperplanes",
something that is still possible near the Hangul block, just before the
existing low and high surrogates).

If someone from Sun can answer me in this list, I'd like to know their
opinion. May be I forgot to read a spec, but I think that Java should
continue to support Unicode better than it does now with Unicode 2.1. The
full support for Unicode 3.0+ in Java is a high priority for all those that
want to support Unicode (for now this support exists in... MS.Net, C#,
"JavaScript", VB...)

Will there be a third edition of the JVM spec that clarifies the allowed
size and semantics of char native types in the VM?



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST