Re: UTF-8

From: addison@inter-locale.com
Date: Wed Sep 20 2000 - 10:53:05 EDT


Dear Stephen,

What exactly happens is that the page contents are converted from the
internal representation (UTF-16) to the target character set (UTF-8).

You appear to be treating the UTF-8 as if it were 8859-1---fooling
the JVM---which is what produces the results that you see. You need to
setup your database connection to store/retrieve UTF-8, not store and
retrieve the UTF-8 as if it were Latin-1. This defeats the purpose of
using a Unicode encoding in your database (you could just use Latin-1 and
save a lot of storage).

So what should happen is:

UTF-16 (database) -> UTF-16 (JVM) -> UTF-8 (page directive)

It sounds like you're doing this:

UTF-16 (database) is being used to store UTF-8 bytes ->
UTF-16 (JVM) -> ISO 8859-1 (restoring the UTF-8 bytes to "UTF-8-ness")

This is a bad idea because:

1. The database doesn't know that it is storing UTF-8. It will not order
or do proper searchs on REAL Unicode queries. You have to handle the whole
translation of UTF-8 thing for it. You will have bad problems if/when you
start to store combining sequences or surrogate pairs.

2. You are relying on always using Latin-1 for your JVM output
encoding. If you generate a date for Japan using a DateFormat (a longer
one than SHORT), you will get "?" instead of the day/month/year
characters (since the JVM will convert CJK characters in Unicode to
Latin-1, where they don't exist).

3. THe HTTP header on your page will have the page encoding. It is a good
idea for the HTTP header and META tag to match. Furthermore, if you ever
use XSL to provide tag libraries, you'll be out of luck: the XML parser
generally writes the META tag for you, based on the page directive :).

Hope this helps,

Addison

On Wed, 20 Sep 2000, Stephen Toner wrote:

> What exactly happens when I use the <%@ page contentType="text/html; utf-8"
> %> directive. When I include this for example letters in the database which
> were stored correctly are rendered incorrectly. They come in exctly the
> same form the form- but when I try to output them to a page with this
> directive it doesn't combine the UTF-8 bytes to form the character, and
> instead treats the bytes as seperate characters. Without it I just use the
> meta tag to interprete the bytes as UTF-8.
> ----- Original Message -----
> From: <addison@inter-locale.com>
> To: "Stephen Toner" <toners5@hotmail.com>
> Cc: "Unicode List" <unicode@unicode.org>
> Sent: Tuesday, September 19, 2000 4:36 PM
> Subject: Re: UTF-8
>
>
> Hi Stephen,
>
> Java's internal encoding is UTF-16. Every String is encoded as
> UTF-16. Since no web pages are generated in that encoding, JSP provides a
> basic mechanism for setting up a character set converter (essentially an
> InputStreamReader and an OutputStreamReader).
>
> The default page encoding for JSP is ISO-8859-1. The processing page will
> hand you UTF-8 instead of 8859-1 if you use the <%@ page
> contentType="text/html; utf-8" %> directive in your page.
>
> If you wish to receive a UTF-8 "POST" or "GET" in an 8859-1 page, you will
> need to setup the InputStreamReader to convert the characters yourself. I
> know I'm being sketchy here, but I'm running late this morning. Let me
> know if the contentType directive doesn't fix your problem.
>
>
> Thanks,
>
> Addison
>
> On Tue, 19 Sep 2000, Stephen Toner wrote:
>
> > Hi,
> > I am still having trouble with inputted UTF-8 from a browser. The problem
> is that my database can't store UTF-8 but only UTF-16. I have tried to
> convert between the two with little success. The trouble is that the
> inputted string is obtained from the request object using String
> temp=request.getParameter("TheText");
> > This leaves me with a string which I think(Please correct me if I'm wrong)
> is correctly encoded in UTF-8 (For example a japanese character was
> converted to a 3-byte sequence.- ,) However the String API only allows me
> to convert a byte array containing non-Unicode text to Unicode or you can
> convert a String object into a byte array of non-Unicode characters. But
> what I have is a string of non-Unicode characters which I must convert to
> Unicode characters. I tried converting it to bytes, which without
> specifying the encoding left 2 question marks in, and with specifying the
> encoding as UTF-8 just converted each character to UTF-16 giving 6 bytes
> instead of the 2 bytes that I wanted. If I was able to somehow get the byte
> values for each character I would be flying, but unfortunately a load of
> different characters get converted to 3F- the code for a question mark.
> > Does anyone know of any way of converting directly in Java?
> > Also when I submit a form page with the encoding specified as UTF-8 what
> actually does the converting from what is in the form to UTF-8?
> > Thanks for any help,
> > Stephen
> >
>
> ===========================================================
> Addison P. Phillips Principal Consultant
> Inter-Locale LLC http://www.inter-locale.com
> Los Gatos, CA, USA mailto:addison@inter-locale.com
>
> +1 408.210.3569 (mobile) +1 408.904.4762 (fax)
> ===========================================================
> Globalization Engineering & Consulting Services
>
>
>

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT