Re: UTF-8 and string manipulations in Java

From: Ed Trager (ed.trager@gmail.com)
Date: Wed Jan 07 2009 - 14:13:33 CST

Next message: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"

Previous message: Mark Davis: "Re: Emoji: chart updated with font glyph images"
In reply to: ktadenev@ups.com: "UTF-8 and string manipulations in Java"
Next in thread: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi, Konstantin,

On Wed, Jan 7, 2009 at 10:42 AM, <ktadenev@ups.com> wrote:
> Hello,
> I have a question on Java internal data manipulations as they pertain to
> UTF-8 strings.
>
> Are these statements correct?
>
>
> 1. java.lang.String expects UTF-8 data and any data manipulations appear to a
> Java programmer as being performed in UTF-8
>
>2. Internally, when a string manipulation method is invoked (e.g., length(),
> charAt(int), etc.), Java converts the string content to UTF-16, performs the
> requested manipulation and converts the content back to UTF-8. None of this
> is visible to the Java developer
>

I don't personally know how Java implements their string class.

However, it is certainly possible to implement a length() method on a
UTF-8 string class *without* having to first convert the UTF-8-encoded
string to another transformation format (UTF-16 or otherwise). The
same would be true for implementing a charAt(int) method. One can
simply use bit masks to look at the most significant bits on the
serialized bytes in UTF-8 string to count the number of Unicode
characters in the string.

Another (and in fact very likely) possibility is for a string class to
use UTF-16 internally, in which case there would be no need to convert
*from* an internal UTF-8, but only *to* UTF-8 when sending output to a
console or file or other device.

Next message: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Previous message: Mark Davis: "Re: Emoji: chart updated with font glyph images"
In reply to: ktadenev@ups.com: "UTF-8 and string manipulations in Java"
Next in thread: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 07 2009 - 14:15:50 CST