Re: UTF-8

From: Stephen Toner (toners5@hotmail.com)
Date: Wed Sep 20 2000 - 07:16:57 EDT


What exactly happens when I use the <%@ page contentType="text/html; utf-8"
%> directive. When I include this for example letters in the database which
were stored correctly are rendered incorrectly. They come in exctly the
same form the form- but when I try to output them to a page with this
directive it doesn't combine the UTF-8 bytes to form the character, and
instead treats the bytes as seperate characters. Without it I just use the
meta tag to interprete the bytes as UTF-8.
----- Original Message -----
From: <addison@inter-locale.com>
To: "Stephen Toner" <toners5@hotmail.com>
Cc: "Unicode List" <unicode@unicode.org>
Sent: Tuesday, September 19, 2000 4:36 PM
Subject: Re: UTF-8

Hi Stephen,

Java's internal encoding is UTF-16. Every String is encoded as
UTF-16. Since no web pages are generated in that encoding, JSP provides a
basic mechanism for setting up a character set converter (essentially an
InputStreamReader and an OutputStreamReader).

The default page encoding for JSP is ISO-8859-1. The processing page will
hand you UTF-8 instead of 8859-1 if you use the <%@ page
contentType="text/html; utf-8" %> directive in your page.

If you wish to receive a UTF-8 "POST" or "GET" in an 8859-1 page, you will
need to setup the InputStreamReader to convert the characters yourself. I
know I'm being sketchy here, but I'm running late this morning. Let me
know if the contentType directive doesn't fix your problem.

Thanks,

Addison

On Tue, 19 Sep 2000, Stephen Toner wrote:

> Hi,
> I am still having trouble with inputted UTF-8 from a browser. The problem
is that my database can't store UTF-8 but only UTF-16. I have tried to
convert between the two with little success. The trouble is that the
inputted string is obtained from the request object using String
temp=request.getParameter("TheText");
> This leaves me with a string which I think(Please correct me if I'm wrong)
is correctly encoded in UTF-8 (For example a japanese character was
converted to a 3-byte sequence.- ,) However the String API only allows me
to convert a byte array containing non-Unicode text to Unicode or you can
convert a String object into a byte array of non-Unicode characters. But
what I have is a string of non-Unicode characters which I must convert to
Unicode characters. I tried converting it to bytes, which without
specifying the encoding left 2 question marks in, and with specifying the
encoding as UTF-8 just converted each character to UTF-16 giving 6 bytes
instead of the 2 bytes that I wanted. If I was able to somehow get the byte
values for each character I would be flying, but unfortunately a load of
different characters get converted to 3F- the code for a question mark.
> Does anyone know of any way of converting directly in Java?
> Also when I submit a form page with the encoding specified as UTF-8 what
actually does the converting from what is in the form to UTF-8?
> Thanks for any help,
> Stephen
>

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT