Based on the responses, I guess my original question/problem was not
very well written.
UTF-7 won't work because it cannot be distinguished from ASCII without
something that identifies it as UTF-7.
The %XX idea does not work because this it already in use by lots of
software
to encode many different character sets. So again we need something that
identifies
it as UTF-8.
What is needed is an escape code that implicitly indicates the Unicode
character set.
I see this as somewhat analogus to the invention of the U+XXXX notation
in Unicode consortium writings? They needed a completely unambiguous way
to tell their readers that the 16 bit value was not "any" 16 bit value
but rather a specific Unicode codepoint. They invented a new kind of escape
sequence that said two things: what follows is hex *and* Unicode.
I see the BOM as filling the same need for text files. It was not enough
to invent Unicode but also a way to identify the encoding.
Paul Deuter
Internationalization Manager
Plumtree Software
paul.deuter@plumtree.com <mailto:Paul.deuter@plumtree.com>
-----Original Message-----
From: Markus Scherer [mailto:markus.scherer@jtcsv.com]
Sent: Thursday, April 26, 2001 11:29 AM
To: unicode
Subject: Re: Unicode in a URL
Paul Deuter wrote:
> I am wondering if there isn't a need for the Unicode Spec to also
> dictate a way of encoding Unicode in an ASCII stream. Perhaps
How many more ways to we need?
To be 8-bit-friendly, we have UTF-8.
To get everything into ASCII characters, we have UTF-7.
W3C specifies to use %-encoded UTF-8 for URLs.
> -----Original Message-----
> From: addison@inter-locale.com [mailto:addison@inter-locale.com]
> itself. The best way to handle it (from a reliability point of view) is to
> use UTF-8 for everything and to reinterpret the URL using code. The idea
This sounds good, too. Have your pages in UTF-8 and all servers will
interpret URLs as UTF-8.
Especially if browsers encode URLs differently, this is your best choice.
Of course, if this all does not work, the obvious choice for Unicode-broken
systems is to use only ASCII characters to begin with...
markus
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT