Unicode in a URL

From: Paul Deuter (Paul.Deuter@plumtree.com)
Date: Wed Apr 25 2001 - 22:41:07 EDT


I am struggling to figure out the correct method for encoding Unicode
characters in the
query string portion of a URL.

There is a W3C spec that says the Unicode character should be converted to
UTF-8 and
then each byte should be encoded as %XX. From my experience however,
browsers will
encode all character sets this way and IIS at least will interpret such hex
bytes according
to the character set that is set on the receiving page. That is to say, the
target page will
read the query string and these hex bytes may be interpreted as ISO-8859-1
or Big5 or
Shift-JIS depending on the target page.

With IIS 5.0, I have stumbled onto the solution of using %uXXXX where XXXX
is the
hexadecimal value of the Unicode character. When I pass Unicode data
formatted this way on
Windows 2000/IIS5 - the data always seems to be decoded properly.
(Apparently this
format came from ECMAScript.)

I don't particularly like the %uXXXX format (primarily because it does NOT
work on NT 4.0 - IIS 4.0)
and I doubt that it would work at all well on other web servers. Does
anyone know of an encoding
method that will actually be properly decoded by a variety of web servers?

Thanks in advance
-Paul



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT