RE: UTF-8 and string manipulations in Java

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jan 11 2009 - 12:41:48 CST

Next message: David Starner: "Re: Emoji: emoticons vs. literacy"

Previous message: Leo Broukhis: "Re: Emoji: emoticons vs. literacy"
In reply to: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Next in thread: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Reply: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> [mailto:unicode-bounce@unicode.org] De la part de Phillips, Addison
> Envoyé : mercredi 7 janvier 2009 21:36
> À : ktadenev@ups.com; unicode@unicode.org
> Objet : RE: UTF-8 and string manipulations in Java
>
>
> Hi Konstantin,
>
> > 1. java.lang.String expects UTF-8 data and any data manipulations
> > appear to a Java programmer as being performed in UTF-8
>
> This is not correct. Java.lang.String is a Unicode string
> type--an array of UTF-16 code units. That is the internal
> encoding of String is UTF-16. Some methods exist (post 1.5)
> for manipulating Unicode code points (i.e. UTF-16 surrogate
> pairs are treated as a single character).
>
> All external data consists of bytes. To create a String, a
> character encoding must be used to convert the bytes to
> String's internal encoding (which, as mentioned, is UTF-16).
> Depending on how you access the data, various character
> encodings may be the default value. Usually it is best to
> specify the encoding, as with InputStreamReader, the String ctor, etc.
>
> Since you are a database architect, you may mean that data in
> JDBC is UTF-8. The encoding uses actually depends on the
> database driver vendor's implementation, although many
> drivers (such as Oracle's) do use UTF-8 on the wire. With
> JDBC, the conversion between the database's internal (native)
> encoding and String's internal UTF-16 encoding is invisible,
> and, in fact, not under programmatic control. Accessing a
> varchar in the database via JDBC is basically transparent:
> you read it as a String object from the ResultSet.
>
> Finally, Java has a "UTF-8-like" serialization for String
> objects that is based on UTF-8, but this is internal to Java
> and should not be confused with either the encoding used by
> String or with a valid access method for strings.

It is NOT invisible to Java programmers given that this specific encoding is
exposed in the API driving the format of compiled java classes (see the JVM
specification) and is also exposed through a subset of the mandatory API
that all JVM implementation must expose for supporting JNI ; this modified
UTF-8 is used for some APIs (to allow easier integration, however this part
of the API should be deprecated as it internally involved data conversion
and space allocation, performed by the JVM itself and not by the JNI
extension used by the client and written in other languages than Java
itself, most often in C or C++). The recommanded (and faster) part of the
JNI API uses UTF-16 (without data conversion and less data allocation).

Even in the case of 100% pure java code, you'll be exposed to this encoding
when handling custom class loaders and dynamic Java code generation. But
it's true that both applications are advanced programming that most java
programmers do not need or use (or only through utility libraries like BCE).

The modified UTF-8 is just used there as a serialization of the UTF-16
internal storage (exposed in the String methods) onto a stream of bytes for
use strictly with Java, it is not meant for interchange, except within the
transport of precompiled Java classes with the usual Java class format.

But other class loaders also exist that use their own format for
compiling/loading or interchanging compiled classes : the JVM spec only
describes the format used by the default builtin classloader that every JVM
must support, and the format for JAR archives and how collection of compiled
classes can be compressed, and searched within the class paths used by the
default class loader). There are several compiled class formats that do not
use this modified UTF-8 serialization, but use UTF-16BE or UTF-16LE
directly: it is not mandatory in Java to *store* the compiled classes with
this described format, as long as they just work with the standard API for
class loaders, and you write a custom ClassLoader that can handle your
format (and can also support the JVM native debug interface on the loaded
classes).

This technic with custom class loaders and storage formats has been done
extensively in Java 1.5 just before the introduction of annotations in Java
6, and this is still used for new advanced (or experimental) features of the
language or within some projects (and this does not require writing any
C/C++ code integrated with JNI, as this is possible using the existing Java
API). So there effectively exists some class formats that use more
compressed formats for storing lots of text within custom storage formats
and without using this Java-specific modified UTF-8 basic serialization
scheme.

Next message: David Starner: "Re: Emoji: emoticons vs. literacy"
Previous message: Leo Broukhis: "Re: Emoji: emoticons vs. literacy"
In reply to: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Next in thread: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Reply: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 13:44:10 CST