Re: UTF8 and URL

From: Martin J. Dürst (mduerst@ifi.unizh.ch)
Date: Tue Aug 19 1997 - 15:33:54 EDT


On Wed, 13 Aug 1997, Yung-Fong Tang wrote:

> Martin ask opinion about using UTF8 in URL . I just find one interesting
> thing:
> Currently we saw several new protocols get map to URL. Some of the
> protocol may already define how to deal with the encoding/text issue.
> So, when we consider restirct to use UTF8 in URL, we should think
> setting a guideline for them...

Of course. The guideline, in a few short words, is that UTF-8 is used
when an URL is existing as such, independent of a carrying text, and
for %HH-encoding.

It is not necessarily used if the URL is carried as text e.g. in a HTML
document. A HTML document carries characters, and these may be in many
different encodings. Assume a document comes in encoded in charset=koi-8.
It contains an imap URL, with Cyrillic characters. The HTML parser
or some related component then detects this URL, knows it's in koi-8,
and converts it to UTF-8 and adds %HH-escaping. This "normal form"
of the URL is then passed to the imap part of the client, which
converts that URL to modified UTF-7 (as explained by John Cowan
in an earlier mail).

This will lead to exactly the behaviour expected by the user:
The mailbox with the same Cyrillic name is accessed independently
of whether the name is typed in into a standalone mail UA that
supports IMAP or whether it turns up in a HTML document using
KOI-8 or any of the other frequently used encodings for Cyrillic.

In the above, %HH-escaping may be bypassed without problems,
and even UTF-8 may be bypassed. But we need UTF-8 for the cases
where URLs are stored independently of a document, and it also
allows us to easily separate different parts of the client software.
We also may need %HH escaping (with one single interpretation!) for
cases where nothing else can be input.

The IMAP example is a very nice study case, an excellent examlpe
of how future URL schemes should be designed (of course, if IMAP
used UTF-8, everything would be even easier, but then it wouldn't
be such a nice study case). Unfortunately, other cases, in particular
HTTP, are not exactly as easy, but there are a lot of possibilities
to deal with backwards compatibility issues.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT