Re: UTF-7 signature

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Apr 11 2002 - 14:49:01 EDT


Shlomi Tal wrote:

> UTF-7, it shocked me how Greek "Sokrates" and "S o k r a t e s" (with
> spaces between each Greek letter in the latter) would have different
> encodings for the same Unicode characters.

That is not unusual for stateful encodings.
It's the same with BOCU-1 (not in this particular case though, but e.g. with U+00A0 between the letters).

> It's a good thing UTF-7 is deprecated; ...

I agree. In an 8bit-clean world, it has little use.
It's interesting, though, to see that UTF-7 actually encodes some text with fewer bytes than UTF-8: everything U+0800..U+FFFF takes 2.67B in UTF-7 but 3B in UTF-8 :-)

> By the way, when converting UTF-16 to UTF-7 through the Win2K/XP command
> prompt (doing "chcp 65000" and then piping the output of the UTF-16 file
> into a new file), the OS transcodes also those values which are deemed
> unsafe by MIME, such as quotation marks, excls, ampersands and so forth.
> This is in contrast to GNU recode (I have the DJGPP 32-bit DOS version
> from Simtelnet), which leaves those characters as they are.

Either is correct. The UTF-7 spec is vague on this.
ICU's UTF-7 converter has an option to use the minimum or maximum set of direct-ASCII characters allowed by the UTF-7 spec.

markus



This archive was generated by hypermail 2.1.2 : Thu Apr 11 2002 - 13:08:55 EDT