Re: PDUTR #26 posted

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Fri Sep 14 2001 - 05:11:17 EDT


Thu, 13 Sep 2001 12:52:04 -0700, Asmus Freytag <asmusf@ix.netcom.com> pisze:

> UTF-32 does have the same byte order issues as UTF-16, except that
> byte order is recognizable without a BOM.

UTF-8 would be used for external communication almost exclusively.
Especially as it's compatible with ASCII and thus fits nicely into
existing protocols.

> Since you speak of internal processing: One software architect I
> spoke with brought this to a nice point: With UTF-16 I can put twice
> the data in my in-memory hash table and have *on average* the same
> 1:1 character code:code point characteristics for processing. That's
> a win-win.

Only if you manage to process characters above U+FFFF correctly.
It's so easy to make processing efficient and wrong.

> UTF-8, while even more compressed for European data (it's 50% larger than
> utf-16 for ideographs), uses multi-code element encoding for all but ASCII,

But UTF-16 also uses multi-code element encoding! For program
complexity it doesn't matter how often it occurs if variable-length
encoding has to be handled anyway. You can't take a character from
a string by random index in either case for example.

> Since most operations are perforce exposed to its variable length,
> unlike UTF-16 processing, which can be optimized for the much more
> frequent 1-unit case,

How optimized? By managing a flag when all characters fit under U+10000
and using separate routines for these cases? It's yet more efficient
to forget about UTF and store characters in 8, 16 or 32 bits, whatever
is the first which fits. Forget about surrogates. It's simpler.

> utf-8 cannot as readily be used as internal format.

It's as easy as UTF-16. Unless you want a broken implementation which
treats surrogates as pairs of characters. It's as broken as treating
multibyte sequences of UTF-8 as separate characters.

> Unicode limited to UTF-8 and UTF-32 would be a lot less attractive
> and you would not have seen it implemented in Windows, Office
> and other high volume platforms as early and as widespread as it
> has been.

I don't use Windows. I use UTF-8 much more often than UTF-16
(but still rarely).

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK



This archive was generated by hypermail 2.1.2 : Fri Sep 14 2001 - 04:03:09 EDT