UTFs, ACEs, and English horns

From: DougEwell2@cs.com
Date: Sun Jun 17 2001 - 13:55:21 EDT


Last year I became aware of, and frustrated with, a new Unicode TES called
UTF-5, proposed in an Internet Draft by James Seng, Martin Dürst, and Tin Wee
Tan. It was intended for encoding internationalized domain names (IDN)
without breaking the existing DNS structure. It used a fairly clever scheme
of transforming the hex representation of a Unicode code point into a
variable-length byte sequence. However, I felt the Internet Draft was poorly
written: there were no guidelines as to which characters were to be encoded
and which (beyond the obvious U+002E) were not, and the examples (and
reference encoder and decoder at www.idns.org) were self-contradictory.

Ken Whistler pointed out, in a reply to my diatribe, that UTF-5 had the
additional problem of not being a true UTF, despite the name. It was really
a TES (transfer encoding syntax), because the intent was to provide a
reversible transform of Unicode characters to avoid violating the DNS naming
requirements. At least the "-5" part was accurate, though.

Subsequently I discovered Internet Drafts describing an assortment of what
had come to be called ACEs (ASCII-Compatible Encodings), all intending to
solve the IDN problem. Mark Davis and Paul Hoffman created things called
LACE (Length-based ACE) and RACE (Row-based ACE), which had elaborate
compression schemes but which at least appeared to be completely specified.
Each encoding came with a special signature, "--bq", to indicate the presence
of a LACE- or RACE-based domain name.

I'm not sure exactly how "ASCII-compatible" these ACEs are, since ASCII
characters (those below U+0080) seem to be encrypted along with everything
else, but that seems to be the accepted name, and at least it doesn't
conflict with Unicode usage.

Now, upon visiting the Internet Drafts index once again, I see a
proliferation of ACEs, including schemes called BRACE and DUNCE. (I can't
tell from the spec whether DUNCE is intended as a joke or not, and I think
that says a lot.) The big question now is which of these burgeoning ACEs
will emerge as the standard, or -- horrors -- whether *more than one* might
be adopted.

But there's more. While looking, out of curiosity, for an update to the
since-expired UTF-5 document, I found:

    draft-ietf-idn-utf6-00.txt

by Mark Welter and Brian W. Spolarich, which claims to describe something
called UTF-6. (Yes!) This document, like many others that plagiarize freely
from François Yergeau's RFC 2279 on UTF-8, copies text and structure from the
BRACE proposal. This type of copying isn't always a bad idea, but it always
raises the question of whether the author fully understood the underlying
concepts or just copied and pasted the words.

So what is this UTF-6? Get ready... it's nothing more than a rehash of Seng
et al.'s UTF-5, with a two-level run-length compression scheme added on.
That's it. It suffers from the same problems as UTF-5, adds compression that
mainly benefits small alphabets (every proponent of a DNS solution seems to
be motivated by a desire to support a specific language or script, often CJK;
Welter and Spolarich seem to have been motivated to support Arabic DNS
names), and of course proposes its own signature, "--wq", to differentiate it
from all the other ACEs and jokers.

That name, "UTF-6", is particularly annoying. As Whistler observed, these
things aren't really UTFs at all, but because of the widespread distribution
and mindless copying of well-written documents like RFC 2279 that describe
well-specified encoding schemes like UTF-8, everybody now claims to have
developed a "UTF." (Compare this to some of the Gedankenexperiments by
Unicode list members, which will never be adopted but which at least qualify
as true UTFs.) The "6" in UTF-6 doesn't refer to anything except the idea
that UTF-6 is an enhancement to UTF-5. Nothing is done in groups of six
bits, bytes, characters, or anything.

There is an observation in the classical music world about the English horn,
to the effect that it is neither English nor a horn. (A similar remark has
been made about the "Holy Roman Empire.") This is the situation with UTF-6:
it is neither a UTF nor is there anything "6" about it.

Much of the discussion on this list concerning Oracle's proposed UTF-8s
mentions the very real problems with proliferating UTFs. They add confusion
to Unicode, especially among non-experts. The explosion of IDN solutions is
similar, except that there are even more proposals out there and even more
confusion. Many companies seem to have developed their own scheme instead of
adopting an existing proposal for no good reason except visions of patent
rights and royalties.

I hope that some order comes to the IDN scene soon, so that the Internet can
have ONE well-defined scheme that allows the use of Unicode in the DNS, does
not leak into the outside world any more than necessary, solves the problem
it was intended to solve in a way that everyone can agree on, isn't
extraordinarily difficult to implement, and DOESN'T call itself a UTF. That
would be music to just about everyone's ears.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT