Re: URLEncode international characters

From: addison@inter-locale.com
Date: Mon Oct 02 2000 - 18:19:26 EDT


Hi Raghu,

Your problem is probably:

Your servlet is running in the default locale of your server and thus
assumes Latin-1 (ISO-8859-1) as the character set of the transaction. It
thus converts the individual bytes of your EUC-JP stream to the Java
internal representation of these characters (which is the UCS-2 encoding
of Unicode). It then stores them in your EUC-JP database (by converting
them to the EUC equivalent of the Latin-1 characters). Thus:

0xA1 0xA1 --> U+00A1 U+00A1 --> 0x8F + character 2014 in this case, etc...

When retrieved you get the underlying "multibyte trash" retrieved as
U+00A1 U+00A1, which becomes the ACTUAL Latin-1 characters on your
display, because your conversion of the String to EUC-JP results
in the same 0x8F trash you stored in the database. PASTING the multibyte
trash on your display "makes the characters back into" EUC, as far as your
web-browser is concerned.

This misconversion can happen in one of a number of places:

1. You are using a multipart MIME interpreter, such as the one on the
O'Reilly website which is not character set aware. You need to add support
for EUC to the interpreter.
2. You are using an InputStreamReader and need to supply a charset
argument to the constructor.
3. You are using JSP and need to supply a <%@ page
contentType='text/html;charset=euc-jp'%> command
to signal what character set to expect.

Hope this helps,

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Mon, 2 Oct 2000, Raghu Kolluru wrote:

>
> I have an simple servlet which gets the form fields and stores in a sql
> server db. Now I am trying to store and retrive international characters
> (charset EUC-JP).
>
> The problem I am having here is:
> For the first time when I send the characters, java gets it as ascii, It
> returns back to the browser (IE 5.5) some junk, now here is the interesting
> thing, I append the same characters to the junk and submit it. Now the later
> text appears fine in the browser.
>
> Question:
> I am thinking that first time the browser encodes the text in ascii, then
> later it encodes it properly. Is there anyway that I can solve this? Any
> help is greatly appreciated.
>
> Raghs
>
> > -----Original Message-----
> > From: Raghu Kolluru
> > Sent: Monday, October 02, 2000 10:52 AM
> > To: 'addison@inter-locale.com'
> > Subject: RE: Major site in unicode?
> >
> >
> > Great! Thanks.
> >
> > > -----Original Message-----
> > > From: addison@inter-locale.com [mailto:addison@inter-locale.com]
> > > Sent: Monday, October 02, 2000 10:24 AM
> > > To: Unicode List
> > > Cc: Unicode List
> > > Subject: RE: Major site in unicode?
> > >
> > >
> > > It knows because:
> > >
> > > 1. You sent the page in that character set, or;
> > > 2. You embedded a token in the page to tell the CGI program what the
> > > character set was, or;
> > > 3. You used the (IE only) hack to get the browser to embed it
> > > in a hidden
> > > field, or;
> > > 4. You guessed it based on a heuristic (or from the user's session
> > > information, maintained in the URL or in a cookie).
> > >
> > > This sounds complex, but it isn't all that bad. Very few
> > users will be
> > > foolish enough to change their display encoding to something
> > > that displays
> > > the page incorrectly...
> > >
> > > Actually, all this talk of "setting browser to Unicode" and
> > > "setting the
> > > browser to code page" is based on a poor assumption or set of
> > > assumptions. What's getting set is the character encoding of
> > > the HTML page
> > > itself. If done correctly, the browser will read it from the
> > > HTTP header
> > > and(or) the META tag.
> > >
> > > The current best practice for creating multilingual capable
> > web sites
> > > (even if they happen to be mono-lingual at any one URL) is to
> > > use Unicode
> > > (either UTF-8 or UTF-16, depending on your operating
> > > environment) internally at the server. A decision can be made
> > > to deliver
> > > either UTF-8 or a non-Unicode legacy encoding at page
> > > delivery time. At
> > > this point in time, most pages are NOT delivered as UTF-8,
> > > even though the
> > > server-side systems are entirely Unicode, because of the
> > > problems cited
> > > earlier with older Netscape and IE browsers and their still
> > relatively
> > > large market share.
> > >
> > > Choosing this architecture allows you to construct
> > single-source code
> > > systems, access databases and data warehouses, and build
> > > applications in a
> > > locale independent way. This vastly simplifies maintenance,
> > > testing, and
> > > deployment compared to legacy charset systems.
> > >
> > > ... many programmers, of course, would like to eliminate the
> > > complexity of
> > > the charset conversion at delivery time, and this day is
> > > coming. I suggest
> > > that you parse UserAgent strings at the start of a session
> > > with a user and
> > > determine if UTF-8 can be sent to the browser (it can in the
> > > majority of
> > > cases and the vast majority of Western and Eastern European
> > > cases: Asian
> > > locales are the big hangup here), and set the result into
> > the session
> > > (see #4 above).
> > >
> > > Hope this helps.
> > >
> > > Addison
> > >
> > > ===========================================================
> > > Addison P. Phillips Principal Consultant
> > > Inter-Locale LLC http://www.inter-locale.com
> > > Los Gatos, CA, USA mailto:addison@inter-locale.com
> > >
> > > +1 408.210.3569 (mobile) +1 408.904.4762 (fax)
> > > ===========================================================
> > > Globalization Engineering & Consulting Services
> > >
> > > On Mon, 2 Oct 2000, Raghu Kolluru wrote:
> > >
> > > > > >> I assume that "the ISO standard" refers to ISO/IEC 8859-1 and
> > > > > >> possibly 8859-2 as well. Unicode is an ISO standard
> > > too (ISO/IEC
> > > > > >> 10646-1).
> > > > > >
> > > > > > So if my browser is set to ISO 8859-1 or ISO
> > 8859-2, but a
> > > > > > Central Euopean or Western European site is only in
> > > > > Unicode, then all
> > > > > > will show up correctly?
> > > > >
> > > > > If your browser is old enough that it can only be "set
> > > to" a single
> > > > > character set, and this setting cannot be overridden by a
> > > "charset=X"
> > > > > tag in the HTML page, then no, it will not be displayed
> > > > > correctly. But
> > > > > this sort of rigidity is not present in modern browsers.
> > > >
> > > > How does the CGI program know that the data submitted is of
> > > "charset=EUC-JP"
> > > > ?
> > > >
> > > > Raghu Kolluru, Software Engg.
> > > > GO.com | Walt Disney Internet Group
> > > > 206-664-4267 | raghu.kolluru@dig.com
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Doug Ewell [mailto:dewell@compuserve.com]
> > > > > Sent: Sunday, October 01, 2000 11:48 PM
> > > > > To: Unicode List
> > > > > Subject: Re: Major site in unicode?
> > > > >
> > > > >
> > > > > >> I assume that "the ISO standard" refers to ISO/IEC 8859-1 and
> > > > > >> possibly 8859-2 as well. Unicode is an ISO standard
> > > too (ISO/IEC
> > > > > >> 10646-1).
> > > > > >
> > > > > > So if my browser is set to ISO 8859-1 or ISO
> > 8859-2, but a
> > > > > > Central Euopean or Western European site is only in
> > > > > Unicode, then all
> > > > > > will show up correctly?
> > > > >
> > > > > If your browser is old enough that it can only be "set
> > > to" a single
> > > > > character set, and this setting cannot be overridden by a
> > > "charset=X"
> > > > > tag in the HTML page, then no, it will not be displayed
> > > > > correctly. But
> > > > > this sort of rigidity is not present in modern browsers.
> > > > >
> > > > > >> The browser you are thinking of is Netscape Navigator
> > > (pre-4.7).
> > > > > >> Support for Unicode in all browsers is improving steadily,
> > > > > and as it
> > > > > >> does, your 'adamant' programmers will end up using
> > > Unicode-encoded
> > > > > >> sites without even realizing it.
> > > > > >
> > > > > > When? 5 years from now? As for using Unicode
> > > without realizing
> > > > > > it, what do you mean? If a Russian's browser is set to
> > > CP1251, what
> > > > > > happens if the site is in Unicode? At present he gets
> > > > > garbage. I've
> > > > > > tried the setting that automatically changes to the
> > > character set of
> > > > > > the page. Doesn't work very well. I think the character set
> > > > > > indication gets left out in many sites.
> > > > >
> > > > > Browsers are supposed to be able to switch automatically to the
> > > > > character set used by the target page, but they cannot
> > > necessarily do
> > > > > this blindly by auto-detecting the character set. It is
> > > > > supposed to be
> > > > > indicated by the page using the "charset=X" tag. Sites
> > > that do not do
> > > > > this are not giving browsers a fair chance to display the page
> > > > > properly. This is not the fault of Unicode or the
> > > browser, but of the
> > > > > HTML author.
> > > > >
> > > > > > I don't disagree with this. It's just at present
> > > > > moment, Netscape
> > > > > > and Explorer don't seem ready. What would really be
> > > needed is the
> > > > > > browser automatically detects the site as being in
> > Unicode, and
> > > > > > switches to that character set. Then sites could switch
> > > > > over without
> > > > > > worry. That is not the case at the moment. So the
> > user has to
> > > > > > change the character set himself.
> > > > >
> > > > > Try using a recent version of your favorite browser (IE
> > > version 5.0 or
> > > > > above, or NN version 4.7 or above).
> > > > >
> > > > > I think the real problem here is that you, your team, and
> > > your users
> > > > > in Russia are working with older versions of software
> > that did not
> > > > > properly handle Unicode, and are assuming that newer
> > > versions will not
> > > > > support Unicode either. Thankfully, this is not the case.
> > > > >
> > > > > -Doug Ewell
> > > > > Fullerton, California
> > > > >
> > > >
> > >
> >
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT