Re: UTF-7 signature

From: Shlomi Tal (shlompi@hotmail.com)
Date: Thu Apr 11 2002 - 14:03:59 EDT


Markus Scherer wrote:

>+/v8 is the encoding of U+FEFF as the first code point in a text. So far,
>so good.
>The '-' as the next byte switches UTF-7 back to direct-encoding of a subset
>of US-ASCII.
>
>What if there is no '-' there? What if a non-ASCII code point immediately
>follows the U+FEFF?
>In such a case, depending on the following code point, the first four bytes
>could be
> +/v8 or +/v9 or +/v+ or +/v/
>
>The 4th byte will not be '8' if the following code point is >=U+4000.

This is more than the stateful irregularity of UTF-7; also demonstrated here
is the violation of the Unicode principle of "one codepoint per each
character". You could write a Unicode character U+xxxx U+yyyy as either
+uuvww- or +uvu-+wvw- (the letters are just placeholders, I didn't intend
any specific equation in them). Ever since I read about UTF-7, it shocked me
how Greek "Sokrates" and "S o k r a t e s" (with spaces between each Greek
letter in the latter) would have different encodings for the same Unicode
characters.

It's a good thing UTF-7 is deprecated; the only reason for still mentioning
it is that it appears as an option on mail clients.

By the way, when converting UTF-16 to UTF-7 through the Win2K/XP command
prompt (doing "chcp 65000" and then piping the output of the UTF-16 file
into a new file), the OS transcodes also those values which are deemed
unsafe by MIME, such as quotation marks, excls, ampersands and so forth.
This is in contrast to GNU recode (I have the DJGPP 32-bit DOS version from
Simtelnet), which leaves those characters as they are.

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos:
http://photos.msn.com/support/worldwide.aspx



This archive was generated by hypermail 2.1.2 : Thu Apr 11 2002 - 12:31:50 EDT