From: Ed Trager (ed.trager@gmail.com)
Date: Wed Jan 07 2009 - 14:13:33 CST
Hi, Konstantin,
On Wed, Jan 7, 2009 at 10:42 AM, <ktadenev@ups.com> wrote:
> Hello,
> I have a question on Java internal data manipulations as they pertain to
> UTF-8 strings.
>
> Are these statements correct?
>
>
> 1. java.lang.String expects UTF-8 data and any data manipulations appear to a
> Java programmer as being performed in UTF-8
>
>2. Internally, when a string manipulation method is invoked (e.g., length(),
> charAt(int), etc.), Java converts the string content to UTF-16, performs the
> requested manipulation and converts the content back to UTF-8. None of this
> is visible to the Java developer
>
I don't personally know how Java implements their string class.
However, it is certainly possible to implement a length() method on a
UTF-8 string class *without* having to first convert the UTF-8-encoded
string to another transformation format (UTF-16 or otherwise). The
same would be true for implementing a charAt(int) method. One can
simply use bit masks to look at the most significant bits on the
serialized bytes in UTF-8 string to count the number of Unicode
characters in the string.
Another (and in fact very likely) possibility is for a string class to
use UTF-16 internally, in which case there would be no need to convert
*from* an internal UTF-8, but only *to* UTF-8 when sending output to a
console or file or other device.
This archive was generated by hypermail 2.1.5 : Wed Jan 07 2009 - 14:15:50 CST