Re: Communicator Unicode

From: David Goldsmith (goldsmith@apple.com)
Date: Fri Sep 12 1997 - 13:37:11 EDT


Markus Kuhn (mskuhn@cip.informatik.uni-erlangen.de) wrote:

>UTF-8 is clearly the preferred one. UTF-7 is a hack to create a
>base64-style encoding for Unicode characters that was once intended
>for e-mail usage. It badly messes up the distinction between the
>character set and the transport encoding in MIME and should be
>forgotten quickly. It is of zero relevance for HTTP (which offers
>binary transparency), and thanks to ESMTP and the security upgrades to
>practically all sendmail installations all over the world that were
>necessary in the past 24 months due to published attack software, the
>7-bit problem of e-mail is also mostly gone today.
>
>UTF-7 is clearly depricated. Unicode and ISO 10646 are standards that
>will continually evolve, and there is very little an implementation
>can do with the knowledge of the version number except always using
>the most recent available font. Therefore the UNICODE-1-1-UTF-8
>identifier has never been a good idea in the first place.

Sigh. I want to clear up a couple of misconceptions here. Of course, I'm
the original author, so take that into consideration.

UTF-7 was intended to produce a Unicode encoding that would reduce to
(mostly) ASCII in the limiting case. Quoted-printable encoded UTF-8 has
the same property, but suffers large expansion for non-Roman text. Like
quoted-printable UTF-8, UTF-7 was intended to be readable by a recipient
who didn't support MIME or Unicode (the latter is still quite relevant).

As for the distinction between character set and transport encoding,
UTF-7 took the form it did after close consultation with the IETF and the
appropriate ietf-charset people. In fact, it was proposed at one point
that it be made a content transfer encoding, and that was explicitly
deprecated by the IETF representatives, as UTF-7 is not general-purpose
enough. I wouldn't have minded either way. If UTF-7 is too much like a
transfer encoding, then so are a lot of other charset encodings, like HZ.

Finally, although SMTP agents may have gotten more 8-bit savvy, most mail
clients I've seen on Macs and Wintel PCs still encode 8 bit content as
quoted printable or Base64 *all the time*.

I agree that UTF-7 is of marginal relevance these days, but it is not
deprecated in any formal sense, and is still useful in some situations.

By the way, the version number of Unicode in the charset names was also
at the insistence of the IETF. It happened at a time when there was still
deep suspicion of Unicode. The newer registrations are dropping the
version numbers.

David Goldsmith
Architect
International, Text, and Graphics Department
Apple Computer, Inc.
goldsmith@apple.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT