Re: PDUTR #26 posted

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Sep 13 2001 - 15:52:04 EDT


At 11:42 AM 9/13/01 +0000, Marcin 'Qrczak' Kowalczyk wrote:
>IMHO Unicode would have been a better standard if UTF-16
>hadn't existed.

Decidedly not. In fact, Unicode would not be widely implemented today.

>Just UTF-8 and UTF-32, code points in the range
>U+0000..7FFFFFFF, no surrogates, no confusion about "how many bits is
>Unicode", an ASCII-compatible encoding in most external transmissions,
>uniform width for internal processing, and practically no byte
>ordering issues. Much simpler.

UTF-32 does have the same byte order issues as UTF-16, except that byte
order is recognizable without a BOM.

The reason that is possible is the reason why a UTF-16 has its place. 1/4
of all bytes in UTF-32 are always and redundantly 0x00. To make matters
worse the next 1/4 of all bytes is redundantly 0x00 as well, except for a
miniscule portion of all data (granted, this proportion can be higher for
some specific documents or corpora).

Since you speak of internal processing: One software architect I spoke with
brought this to a nice point: With UTF-16 I can put twice the data in my
in-memory hash table and have *on average* the same 1:1 character code:code
point characteristics for processing. That's a win-win.

Using UTF-32 the same system would have to use double the memory, or face
twice the rate of memory-fault page operations, and still, because of the
way scripts work, there are many operations that need to look at more than
one character code at a time even in UTF-32.

UTF-8, while even more compressed for European data (it's 50% larger than
utf-16 for ideographs), uses multi-code element encoding for all but ASCII,
which is why it's useful primarily for external data that are rich in ASCII
(like HTML etc.). Since most operations are perforce exposed to its
variable length, unlike UTF-16 processing, which can be optimized for the
much more frequent 1-unit case, utf-8 cannot as readily be used as internal
format.

Unicode limited to UTF-8 and UTF-32 would be a lot less attractive and you
would not have seen it implemented in Windows, Office and other high volume
platforms as early and as widespread as it has been.

A./



This archive was generated by hypermail 2.1.2 : Thu Sep 13 2001 - 15:51:49 EDT