Re: What if UTF-8 had been defined after UTF-16?

From: Doug Ewell (dewell@compuserve.com)
Date: Tue Apr 11 2000 - 10:49:08 EDT


Markus Scherer <markus.scherer@jtcsv.com> wrote:

> UTF-8 could have had all the nice features that it has now, plus:
> - C1 control codes (0x80..0x9f) passed through as single bytes
> - no sequences longer than 4 bytes, BMP still covered with 3 bytes
> - no checking for code points > 0x10ffff because
> it could have been designed just for that range
> - no minimum-length problem -> no security concerns
> - all byte values used for some encoding

The idea of designing a UTF to encode only the target range is a good
one, which is why UTF-8 was designed to encode exactly the target range
at the time it was designed, U-00000000 to U-7FFFFFFF. (Note that you
cannot use UTF-8 to encode the mythical U-80000000 through U-FFFFFFFF.)

Unfortunately, while reducing this "edge case" range check, UTF-8C1
requires a range check to find the breaks between one- and two-byte and
between two- and three-byte encodings, which would be a much more common
thing to have to determine. You can't tell how many bytes are in the
encoded form by looking at the number of 1-bits in the lead byte, and
conversely you can't tell how many bytes are required to encode a
character by simple bit tests. You must do a comparison, a range check.

The additive offsets achieve the design goals of avoiding "ambiguously
codeable" code points, but are annoying in the same way that the UTF-16
additive offset of 0x10000 is annoying. In this case it results in a
very awkward breaking point between two- and three-byte encodings
(U+03A0, which neatly slices the Greek upper-case alphabet in half).

Requiring three bytes instead of UTF-8's two for so many more scripts
(most of Greek and all of Cyrillic, Armenian, Hebrew, Arabic, Syriac,
and Thaana) seems like a big step backward.

All this to encode the C1 characters in one byte. It's interesting that
the most commonly proposed/suggested/dreamed improvements to UTF-8 fall
into these two mutually exclusive categories:

1. Preserve the one-byte encoding of C1 characters at the expense of
    Latin-1.

2. Preserve the one-byte encoding of Latin-1 characters at the expense
    of C1.

I recently stumbled across a now-expired proposal from Jerome Abela
dated 1997-12-23 for a thing called UTF-9, which claimed to "preserv[e]
the full ISO-Latin1 range" although no encoding at all was specified
for the C1 characters (0x80-0x9F). It sacrificed the ability to find
character boundaries from an arbitrary point and required some baroque
checking of the lead byte to find the number of trailing bytes.

For them what cares, the UTF-9 draft is available at:

    http://beatles.cselt.it/mirrors/drafts/draft-abela-utf9-00.txt

On the other hand, the proposals (or dreams) to "preserve" the C1 area
include Jörg Knappen's UTF-7d5 and now Scherer's UTF-8C1, as well as
UTF-EBCDIC (defined in UTR #16), which differs in principle only in that
there is an ASCII-EBCDIC transcoding built into the scheme.

All of this proves that all needs cannot be met at once. As Frank da
Cruz recently stated, "compromises" mean that nobody gets everything
they want, but the end result is acceptable to all sides. UTF-8 treats
both camps fairly by encoding neither C1 *nor* Latin-1 in one byte :-),
but after all, its intent was to preserve one-byte encoding of the ASCII
range, nothing more.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT