Re: UTF-8 and string manipulations in Java

From: Ed Trager (ed.trager@gmail.com)
Date: Wed Jan 07 2009 - 14:13:33 CST

  • Next message: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"

    Hi, Konstantin,

    On Wed, Jan 7, 2009 at 10:42 AM, <ktadenev@ups.com> wrote:
    > Hello,
    > I have a question on Java internal data manipulations as they pertain to
    > UTF-8 strings.
    >
    > Are these statements correct?
    >
    >
    > 1. java.lang.String expects UTF-8 data and any data manipulations appear to a
    > Java programmer as being performed in UTF-8
    >
    >2. Internally, when a string manipulation method is invoked (e.g., length(),
    > charAt(int), etc.), Java converts the string content to UTF-16, performs the
    > requested manipulation and converts the content back to UTF-8. None of this
    > is visible to the Java developer
    >

    I don't personally know how Java implements their string class.

    However, it is certainly possible to implement a length() method on a
    UTF-8 string class *without* having to first convert the UTF-8-encoded
    string to another transformation format (UTF-16 or otherwise). The
    same would be true for implementing a charAt(int) method. One can
    simply use bit masks to look at the most significant bits on the
    serialized bytes in UTF-8 string to count the number of Unicode
    characters in the string.

    Another (and in fact very likely) possibility is for a string class to
    use UTF-16 internally, in which case there would be no need to convert
    *from* an internal UTF-8, but only *to* UTF-8 when sending output to a
    console or file or other device.



    This archive was generated by hypermail 2.1.5 : Wed Jan 07 2009 - 14:15:50 CST