That UTF-8 Rant (was Unicode in source)

From: Addison Phillips (AddisonP@simultrans.com)
Date: Thu Jul 22 1999 - 12:28:22 EDT


Actually, I am aware of the advantages for using UTF-8 in Internet
transmissions: it's inherently (*explicitly*) non-endian and well-suited to
byte stream applications (the encoding allows recovery with only one lost
character in the event of transmission error).

But for usage internal to programs and for text files stored on disk (e.g.
the default encoding for most applications, although not, admittedly, for
Web presentation), UTF-16 *is* cleaner. To wit:

o The characters are all 16-bits in the BMP, in terms of processing (yes,
yes, combining marks take more than one character to encode... but for
processing purposes everything is the same width. Yes, there are higher
planes of existence and these cannot be ignored...).
o There is less text expansion for non-Latin languages.
o There are programmatic ways of handling Unicode text via TCHAR that
reduces the impact on code. If you don't unthread UTF-8 to UTF-16, text
processing becomes somewhat uglier.
o For languages other than Western European, the text expansion is much
smaller than for UTF-8, so storage is conserved.

If you are going to write a text processing application, why would you make
UTF-8 the default internally, when UTF-16 is so much easier to code?
Programmer's editors are, of course, text processing applications, and while
they need to handle UTF-8 (read and write), internally UTF-16 is going to be
much cleaner. Are you arguing to use UTF-8 because it makes the lexical
analyser you've already written able to sorta-kinda process Unicode? I think
the other messages on this thread clearly show why this is a potential
problem if we allow Unicode into our identifiers. (It's not a problem if you
confine Unicode to string literals). Write a lexer than can handle
UTF-16/UCS-2. It's a lot easier to preprocess all of your text to that
encoding before lexing it than it is to have multi-octet data.

In short, my grinding axe says: write code for UTF-16. Where possible, store
UTF-16. Give the user the choice of storing or retrieving other encodings
and character sets, including UTF-8. I mean: the whole point of a Universal
Character Set is that it is universal. In theory it support whatever your
external character set encodes. UTF-16 is the best encoding to use internal
to code absent legacy considerations (and attendant cost/time issues in
implementation). If people want to destroy data by storing it in some other
character set, let them (knowingly) do it. If people need UTF-8 to transmit
data, offer it as an option.

I'm not saying that UTF-8 is bad. I *like* UTF-8 and cherish a warm place in
my heart for it as an encoding. It is, in fact, a beautiful design. As a
workaround it has very nice, precise benefits and I've used it to good
effect in a number of projects. But usually as a transmission protocol
through somebody else's pipe or through a legacy system where UTF-16 dare
not tread unprotected. The text gets unwrapped through a few lines on C on
the other side.

So I'm not against people storing UTF-8 on disk if that's really their
hearts desire, but I think it makes sense to use UTF-16 wherever possible in
implementation. IMHO.

Addison

-----Original Message-----
From: G. Adam Stanislav [mailto:adam@whizkidtech.net]
Sent: mercredi 21 juillet 1999 19:05
To: Addison Phillips
Cc: Unicode List; mohrin@sharmahd.com
Subject: Re: Unicode in source code. WHY?

On Wed, Jul 21, 1999 at 03:03:57PM -0700, Addison Phillips wrote:
> UTF-8 is a kludge.

UTF-8 is the only encoding besides ASCII that all Internet protocols are
required to understand (I do not recall the RFC number, but I can look
it up if you wish). And since ASCII is a subset of UTF-8, one may say
UTF-8 is the one and only required encoding.

> [snip]
> But it's still a kludge. At some point, "real" Unicode text files should
> become the norm, rather than having to transform everything. Let's prompt
> editor writers to create editors that read and write Unicode without
blowing
> chunks. UTF-8 is merely a detour (albeit a very useful one).

The problem is there is no such thing as "real" Unicode text files. Unicode
is 16-bit, ISO is 32-bit. Which one is real? Some systems are big-endian,
others little-endian. Which one is real?

IMHO, Unicode made a wise choice not to decide. The way I read it, Unicode
is not an encoding but a mapping, hence there is no "real" Unicode text
file. Or, perhaps, any encoding one can think up is real as long as it
can encode all of Unicode and decode it back.

UTF-8 solves all these incompatibilities in a nice way. I agree that it is
not perfect (ASCII only text needs no encoding, Roman alphabets with
diacritics require some encoding, Chinese and other non-European characters
require a lot of encoding). But it is the best I have see so far.

> PS: <grinding axe>Yes, I know that the standard says UTF-8 is "real"
> Unicode. But UTF-8 should not, IMHO, be the encoding *of choice* for the
> future. It's the encoding of choice for supporting the past.</grinding
axe>

The rule that all Internet protocols must understand UTF-8 only started
several months ago. That makes it very much the encoding of the future.
At least on the Internet.

Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT