RE: That UTF-8 Rant

From: Addison Phillips (AddisonP@simultrans.com)
Date: Thu Jul 22 1999 - 20:26:29 EDT


One note:

I have done a number of projects with database components. In all of these,
the majority of the string data (varchar, what have you) stored in the
system was multibyte text (e.g. Japanese, Chinese, or Korean data) when the
customer language was CJK.

You are correct that most DATA is not string data. I can't think of a
project that I have worked on where text comprised the majority of the data
in the database. Most of it is some other data type. However, I can think of
several applications where that would not hold true.

One of the projects I worked on kept the data in RAM at runtime (for
performance). A 50% expansion of the text requires, on average, between 12%
and 25% expansion of total storage for Asian languages. *YOU* can tell that
Japanese that their product requires 1250 MB of RAM instead of 1000MB (and
loads 25% slower in the morning.)! ;-) --- ironically, I chose to use UTF-8
for a significant portion of that implementation because of legacy
considerations.

For Western-Euro-only applications, UTF-8 is a nice compression scheme...
but as an "uncompression scheme" for Asian it sucks and that's one reason
why I'm down on our old buddy FSS-UTF. Most companies that I am aware of
that are doing enterprise-level software releases with a single global
binary are using UTF-16, in part because of performance issues related to
size in Asian languages (and the most demanding customers will probably be
Asian!) and because optimizing the code for it isn't THAT hard.

Some of your statistics are specious. Over half the Internet is still here
in the USA. As the rest of the world catches up, the ratio of ASCII to what
we used to call MBCS will diminish. There's a whole lot of Shift-JIS out
there already!

That's at least my take on it.

Addison

-----Original Message-----
From: Markus Kuhn [mailto:Markus.Kuhn@cl.cam.ac.uk]
Sent: jeudi 22 juillet 1999 16:22
To: Unicode List
Subject: Re: That UTF-8 Rant

Kenneth Whistler wrote on 1999-07-22 20:37 UTC:
> > UTF-16
> > is in my eyes primarily a political correctness exercise towards the
> > users of scripts who use 6 months of Moore's law by the 3-byte encoding
> > of their characters.
>
> I agree that storage space arguments aren't usually of much value.
> Especially if you are talking about word-processing applications.
> But size does make a difference when one starts talking about
> multi-gigabyte and multi-terabyte database applications. People who make
> decisions about database design do care about such things. And data
> transmission times make a difference, too -- although this can be
> addressed with compression in either case.

Actually, I happen to be extremely interested in exactly these
questions, because I happen to be someone who makes implementation
decisions about databases that could one day grow into the
hundreds-of-gigabyte range. I have not yet seen multi-terabyte plain
text databases though (perhaps the email/fax eavesdroppers at the NSA
have these, if anyone ;-), these tend more to be filled with images and
not text.

Therefore a few of my observations in this field:

 - While RAM and disk space isn't a big issue any more these days,
   bus and network bandwidth and the speed of search algorithms still
   is and will continue to be for some time

 - Network bandwidth can be taken care of by LZ-style compression
   algorithms, but CPU bus bandwidth can't.

 - Bus bandwidth can be a limiting factor in index traversal and full-text
   substring searches.

 - The vast majority (>> 80%) of characters handled today in networked
   databases are 7-bit ASCII. I have yet to see a single >10 gigabyte
   database consisting predominantly of non-Latin text (outside the
   basement of US intelligence agencies :). This is not an issue of
   the deployment of Unicode, because suitable national non-Latin
   character sets have been around for over 15 years.

 - Given the global mix ratio of ASCII versus non-ASCII characters used
   on the Internet today, I believe that UTF-8 is on average almost half
   as short as UTF-16.

 - Many important algorithms such as fulltext string search and B-tree
   prefix lookups can equally easily be implemented in both UTF-8 and
   UTF-16, however their execution speed is proportional to the number
   of bits required by the encoding and transfered through the bus
   bottleneck.

 - UTF-8 is a very simple compression algorithm that thanks to its
   stateless encoding is compatible with most string search and indexing
   algorithms, while better compression algorithms such as gzip
   are certainly not.

I believe that certainly in the western world, but most likely also on a
global average, UTF-8 gives therefore a close to 50% performance
improvement over UTF-16 in database lookups.

I am well aware that my experimental evidence here is not yet very
complete, but at least I wouldn't dismiss the use of UTF-8 in
high-performance database applications immediately based on performance
reasons. It might well be the more efficient solution in real-live
applications.

> I would consider Microsoft
> Windows a "big field of application" by any reasonable measure.
> Java is another "big field of application". IBM and Apple both make
> extensive use of UTF-16. I won't start running down the medium-sized
> companies using it.

I am not in favour of quoting large companies who have made certain
technical decisions as an argument for the quality of a specific
technical solution. We all have read enough Dilbert to understand how
technical decisions are usually made today in our hype-driven commercial
world. At least I am more skeptically amused than impressed by "it must
be right way because Microsoft Windows, IBM, Apple, etc. also do it"
(especially if the word Java appears in the same paragraph).

Millions of flies can't be wrong: manure tastes lovely.

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT