RE: UTF-8 and string manipulations in Java

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jan 11 2009 - 12:41:48 CST

  • Next message: David Starner: "Re: Emoji: emoticons vs. literacy"

    > [mailto:unicode-bounce@unicode.org] De la part de Phillips, Addison
    > Envoyé : mercredi 7 janvier 2009 21:36
    > À : ktadenev@ups.com; unicode@unicode.org
    > Objet : RE: UTF-8 and string manipulations in Java
    >
    >
    > Hi Konstantin,
    >
    > > 1. java.lang.String expects UTF-8 data and any data manipulations
    > > appear to a Java programmer as being performed in UTF-8
    >
    > This is not correct. Java.lang.String is a Unicode string
    > type--an array of UTF-16 code units. That is the internal
    > encoding of String is UTF-16. Some methods exist (post 1.5)
    > for manipulating Unicode code points (i.e. UTF-16 surrogate
    > pairs are treated as a single character).
    >
    > All external data consists of bytes. To create a String, a
    > character encoding must be used to convert the bytes to
    > String's internal encoding (which, as mentioned, is UTF-16).
    > Depending on how you access the data, various character
    > encodings may be the default value. Usually it is best to
    > specify the encoding, as with InputStreamReader, the String ctor, etc.
    >
    > Since you are a database architect, you may mean that data in
    > JDBC is UTF-8. The encoding uses actually depends on the
    > database driver vendor's implementation, although many
    > drivers (such as Oracle's) do use UTF-8 on the wire. With
    > JDBC, the conversion between the database's internal (native)
    > encoding and String's internal UTF-16 encoding is invisible,
    > and, in fact, not under programmatic control. Accessing a
    > varchar in the database via JDBC is basically transparent:
    > you read it as a String object from the ResultSet.
    >
    > Finally, Java has a "UTF-8-like" serialization for String
    > objects that is based on UTF-8, but this is internal to Java
    > and should not be confused with either the encoding used by
    > String or with a valid access method for strings.

    It is NOT invisible to Java programmers given that this specific encoding is
    exposed in the API driving the format of compiled java classes (see the JVM
    specification) and is also exposed through a subset of the mandatory API
    that all JVM implementation must expose for supporting JNI ; this modified
    UTF-8 is used for some APIs (to allow easier integration, however this part
    of the API should be deprecated as it internally involved data conversion
    and space allocation, performed by the JVM itself and not by the JNI
    extension used by the client and written in other languages than Java
    itself, most often in C or C++). The recommanded (and faster) part of the
    JNI API uses UTF-16 (without data conversion and less data allocation).

    Even in the case of 100% pure java code, you'll be exposed to this encoding
    when handling custom class loaders and dynamic Java code generation. But
    it's true that both applications are advanced programming that most java
    programmers do not need or use (or only through utility libraries like BCE).

    The modified UTF-8 is just used there as a serialization of the UTF-16
    internal storage (exposed in the String methods) onto a stream of bytes for
    use strictly with Java, it is not meant for interchange, except within the
    transport of precompiled Java classes with the usual Java class format.

    But other class loaders also exist that use their own format for
    compiling/loading or interchanging compiled classes : the JVM spec only
    describes the format used by the default builtin classloader that every JVM
    must support, and the format for JAR archives and how collection of compiled
    classes can be compressed, and searched within the class paths used by the
    default class loader). There are several compiled class formats that do not
    use this modified UTF-8 serialization, but use UTF-16BE or UTF-16LE
    directly: it is not mandatory in Java to *store* the compiled classes with
    this described format, as long as they just work with the standard API for
    class loaders, and you write a custom ClassLoader that can handle your
    format (and can also support the JVM native debug interface on the loaded
    classes).

    This technic with custom class loaders and storage formats has been done
    extensively in Java 1.5 just before the introduction of annotations in Java
    6, and this is still used for new advanced (or experimental) features of the
    language or within some projects (and this does not require writing any
    C/C++ code integrated with JNI, as this is possible using the existing Java
    API). So there effectively exists some class formats that use more
    compressed formats for storing lots of text within custom storage formats
    and without using this Java-specific modified UTF-8 basic serialization
    scheme.



    This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 13:44:10 CST