At 15:02 01/04/26 -0700, Paul Deuter wrote:
>Based on the responses, I guess my original question/problem was not
>very well written.
>The %XX idea does not work because this it already in use by lots of
>software
>to encode many different character sets. So again we need something that
>identifies
>it as UTF-8.
It's used with lot's of different encodings. Adding one more (UTF-8)
won't make it much worse, in the first place.
Second, it turns out that UTF-8 is extremely easy to detect/check,
the easiest of all encodings. For details, see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
Apart from that, the HTTP protocol says exactly what you can send,
and so you can't just invent something new (such as %u....),
even though it might work 'sometimes'.
>I see this as somewhat analogus to the invention of the U+XXXX notation
>in Unicode consortium writings? They needed a completely unambiguous way
>to tell their readers that the 16 bit value was not "any" 16 bit value
>but rather a specific Unicode codepoint. They invented a new kind of escape
>sequence that said two things: what follows is hex *and* Unicode.
>
>I see the BOM as filling the same need for text files. It was not enough
>to invent Unicode but also a way to identify the encoding.
The BOM for UTF-8 is doing a lot of damage. All the tools that
would work very nicely without the BOM stop to work.
Regards, Martin.
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT