Why "UTF-5" is not a UTF

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 02 2000 - 21:59:28 EST


> At 07:30 AM 03/02/2000 -0800, Doug Ewell wrote:
> >I've been working on implementing a UTF-5 encoder and decoder based on
> >the specifications in the file
> >
> >http://ftp.univie.ac.at/netinfo/internet-drafts/draft-jseng-utf5-01.txt
> >
> >and I am running into problems with what I will call "UTF-5 mode,"

Bob Rosenberg responded to Doug Ewell:
>
> You are not looking at the problem correctly. In the case of an Email
> Address, the syntax is name@domain. In the example shown, the CONTENTS of
> name and domain are rendered in UTF-5 NOT the full string. Thus you pass
> the 3 sections of the address (which are delineated by the "@" and the ".")
> through the converter SEPARATELY. IOW: You must parse the string based on
> its format to extract the UTF-5 sections (as well as syntax validate it).
>

Actually, there is a much more serious problem represented by the
UTF-5 Internet Draft. The very term "UTF-5" is seriously misleading,
because "UTF-5" is not a Unicode Transformation Format at all,
as defined by the standard, but instead represents a Transfer Encoding Syntax
(TES) masquerading as a UTF.

It does a serious disservice to implementers to be promoting another
"UTF" which is not a UTF. This confuses and dilutes the meaning of the
true, sanctioned UTF's, namely, UTF-8 and UTF-16 (and the soon-to-be-approved
UTF-32). Unicode Transformation Formats are Character Encoding Forms (CEF),
not Transfer Encoding Syntaxes.

The Unicode Technical Report #17, Character Encoding Model, clarifies the
meanings of "coded character set", "character encoding form", "character
encoding scheme", and "transfer encoding syntax". In particular note the
following definitions:

"A character encoding form is a mapping from the set of integers used in
a Coded Character Set to the set of sequences of code units. A code unit is
an integer occupying a specified binary width in a computer architecture,
such as an 8-bit byte. The encoding form enables character representation
as actual data in a computer."

UTF-8 is an encoding form for Unicode/10646 that maps the scalar values
for the encoded characters to (sequences of) 8-bit code units.

UTF-16 is an encoding form for Unicode/10646 that maps the scalar values
for the encoded characters to (sequences of) 16-bit code units.

UTF-32 is an encoding form for Unicode that maps the scalar values for
the encoded characters to single 32-bit code units.

"A character encoding scheme is a mapping of code units into serialized
byte sequences."

UTF-8, UTF-16BE, UTF-16LE (and in the future, UTF-32BE, UTF-32LE) are the
Unicode encoding schemes. But this concept also encompasses non-Unicode
encoding schemes such as ISO 2022-JP or IBM host mixed code pages that may
mix and serialize more than one coded character set into byte sequences.

"A transfer encoding syntax is a reversible transform of encoded data which
may (or may not) include textual data represented in one or more character
encoding schemes."

TES's are things like base64, uuencode, BinHex, quoted-printable, etc., that
are designed to convert textual (or other) data into sequences of byte
values that avoid particular values that would confuse one or more Internet or
other transmission/storage protocols.

Now let's look at "UTF-5". The Internet Draft points out applications of
"UTF-5":

"In the Domain Name System, although the technical standard does not
prevent 8-bits character to be use as domain names, general use of
the system restrict it to only A-Z (upper and lower), 0-9 and "-" as a
valid domain name. This poses great difficulty when doing i18n of domain
names as the current UTF-7, UTF-8 and UTF-16 are not compatible with the
existing software system already in used." [sic]

Why would UTF-5 address this problem? Because it is a reversible transform
of encoded data which is designed to avoid all the unsafe byte values
in domain names, restricting values to those that, when interpreted in
an ASCII-based coded character set (as would be the case for all
domain name processing), only '0'..'9' and 'A'..'V' would appear in
the data, thus being compatible with the preexisting protocol practical
implementation restrictions.

The argument is the same for the use of "UTF-5" for SMTP Addresses --
by converting the original data into a form where no byte values outside
'0'..'9' and 'A'..'V' would appear in the encoded portions of a string
using this coding, email addresses can be converted to a form which
can safely pass through security checks used by the existing SMTP
protocol implementations screening for control characters or other bytes in
email addresses.

Now let's look again at Doug Ewell's confusion. The Internet Draft
shows as an example:

"For example, an SMTP Email address for "yamaguchi@asahi.ninhon" [sic]
(5C71 53J3 [sic] '@' 671D 65E5 '.' 65E5 672C) can be represented in
UTF-5 "LC71L3E3@M71DM5E5.M5E5M72C". ... "

Let's untangle this example.

The orginal textual data for the address can be represented as a Unicode
string in the UTF-16 encoding form:

5C71 53E3 0040 671D 65E5 002D 65E5 672C (i.e. "yamaguchi@asahi.nihon")

Someone who wants to make use of the "UTF-5" TES for turning this into
an email address that would be safe to pass through the SMTP protocol
would need to first turn this into three substrings (as Bob Rosenberg
indicated):

5C71 53E3
@
671D 65E5
.
65E5 672C (This of course, presumes that anything other than "jp" will
            actually ever be registered for this part of the address, but
            that is a side issue.)

Then, using the algorithm of the Internet Draft, each of the three
substrings can be converted into sequences of binary quintets which
are then mapped onto the characters '0'..'9', 'A'..V' via a translation
table.

The missing next piece is the assumption that those characters are then
themselves represented as byte values according to the ASCII coded
character set. You end up with a converted byte sequence of:

4C 43 37 31 4C 33 45 33 40 4D 37 31 44 4D 35 45 35 2D 4D 35 45 35 4D 37 32 43

That byte sequence can be pumped out via SMTP protocol as an address, and
no security alarms will sound, since it will all be interpreted *as if*
it were just an ASCII string (instead of "XX"@"YY"."ZZ" with the XX, YY, and ZZ
cleverly converted to byte values pretending to be ASCII). I.e. the mail
protocol simply sees:

"LC71L3E3@M71DM5E5.M5E5M72C"

The recipient would have to be told via a MIME tag for the use of "UTF-5"
as a TES for the address (or be smart enough to figure out heuristically) that
the address is actually gobbledygook, and needs to be run backward through
the "UTF-5" decoding to parse the intended substrings out of the address.

However, it is important to note that this clever dodge results in an
address that is actually conveyed in *some* coded character set -- it
only appears like ASCII. The address could actually be ASCII, or the encoding
used for the email could be 8859-1, or Code Page 1252, or even UTF-8. All
of those would still interpret the bytes of "UTF-5" as the ASCII characters.

So if my mailer was supporting UTF-8 (a "real" encoding scheme), I would
have two layers of interpretation for the address:

      4C 43 37 31 4C 33 45 33 40 4D 37 31 44 4D 35 45 35 2D 4D 35 45 35 4D 37 32 43
UTF-8 -----------------------------------------------------------------------------
UTF-5 ======================= ======================= =======================

In other words, the entire string is being handled as textual data, using
the Unicode encoded character set in the UTF-8 encoding form. But because
this is actually an address *encrypted* for transmission using "UTF-5", it
has embedded within it 3 little pieces of "stuff" that have to be run
backwards through the TES to obtain the actual encoded characters they
are supposed to represent. Then I have to represent those encoded characters
in the *actual* encoding form that I am using -- which in this case happens
to be UTF-8. The net result for this data would be:

E5 B1 B1 E5 8F A3 40 E6 9C 9D E6 97 A5 2D E6 97 A5 E6 9C AC

*That* is now a complete string, interpretable according to the Unicode
Standard, using encoding form UTF-8.

No wonder Doug was confused about how to implement an encoder/decoder for
this.

Please, please, please. Cease and desist from inappropriately claiming the
label "UTF" for these kinds of data encryption schemes. There is nothing
good which comes from it -- only confusion and heartache.

Call it "TES-5" if you want, or "base32", or "alphanumuni" or "uniencode"
or something! Just don't call it a "UTF".

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT