Re: UTF-8 and string manipulations in Java

From: Johannes Rössel (joey@muhkuhsaft.de)
Date: Wed Jan 07 2009 - 11:10:57 CST

Next message: Daniel Ehrenberg: "Word break tests"

Previous message: James Kass: "Re: Emoji: emoticons vs. literacy"
In reply to: ktadenev@ups.com: "UTF-8 and string manipulations in Java"
Next in thread: Ed Trager: "Re: UTF-8 and string manipulations in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello,

first, this question probably belongs into a Java mailing list, not a
Unicode one, as it deals with Java specifics, not Unicode per se.

> 1. java.lang.String expects UTF-8 data and any data manipulations
> appear to a Java programmer as being performed in UTF-8
>

To cite the Java Language Specification, Third Edition (p. 48): “The
Java programming language represents text in sequence of 16-bit code
units, using the UTF-16 encoding. A few APIs, primarily in the Character
class, use 32-bit integers to represent code points as individual
entities. The Java paltform provides methods to convert between the two
representations.”

No reference to UTF-8 is made anywhere within the specification.

The prevalent encoding for source code files that use Unicode directly,
is probably UTF-8. Though the conversion of string literals into UTF-16
is done by the compiler here.

I am not sure what exactly you mean by “appear to a Java programmer as
being performed in UTF-8”. String processing will always be done on the
string, or on substrings of characters. No relationship whatsoever is
based on the bytes that make up the string, if that's what you mean.
within strings you may have to deal with high or low surrogate code
units (U+D800–U+DFFF), though (not sure, since I never tried).

> 2. Internally, when a string manipulation method is invoked
> (e.g., length(), charAt(int), etc.), Java converts the string content
> to UTF-16, performs the requested manipulation and converts the
> content back to UTF-8. None of this is visible to the Java developer
>

Not that I know of. A Java implementation which behaves like this
probably violates the specification in the quoted section above.

Regards,
Johannes

Next message: Daniel Ehrenberg: "Word break tests"
Previous message: James Kass: "Re: Emoji: emoticons vs. literacy"
In reply to: ktadenev@ups.com: "UTF-8 and string manipulations in Java"
Next in thread: Ed Trager: "Re: UTF-8 and string manipulations in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 07 2009 - 11:14:09 CST