RE: UTF-8 <> UCS-2/UTF-16 conversion for library use

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Sep 24 2001 - 14:05:55 EDT


Mike,

> > The typical situation involves cases where large data sets
> > are cached in
> > memory, for immediate access. Going to UTF-32 reduces the
> > cache effectively
> > by a factor of two, with no comparable increase in processing
> > efficiency to
> > balance out the extra cache misses. This is because each
> > cache miss is
> > orders of magnitude more expensive than a cache hit.
>
> For this situation you have a good point. For others, however, the
> extra data space of UTF-32 is bound to be lower cost than having to check
> every character for special meaning (i.e. surrogate) before passing it on.

As the price of memory decreases so does processing. Not all functions
require that you deal with the data on a per character level. Many UTF-16
implementations also use UTF-32 internally. Converting a UTF-16 character
to a code point is much faster than UTF-8. One of these days UTF-16 may
disappear but not just yet. If you doubled the size of Windows program
resource tables or ICU's data tables people would scream. If you double a
large program's load time then people will not be happy.

>
> Funny. You see robustness, I see latent bugs due to rarely
> exercised code paths.
>

This is why you have to use white box testing at the function level. When
compared to UTF-8 this is a snap. If you follow the same logic that you
should immediately convert all UTF-8 data to UTF-32. There are many things
you have to check for with UTF-8 and strange thing like non-short form UTF-8
that can create very subtle bugs. Bad UTF-8 data can also create more
severe problems so that your code needs more logic just to protect itself.

You have subtle errors in UTF-32 also. What do you do with a character that
falls in the surrogate range? Assume that you are converting it to UTF-8.
This will create an bad UTF-8 character.

For many application UTF-16 is a good compromise between a large code size
and processing efficiency. As this industry changes the decision points
change. Then there is always the great argument that many applications that
were written for UCS-2 are much easier to convert to UTF-16.

Carl



This archive was generated by hypermail 2.1.2 : Mon Sep 24 2001 - 13:00:02 EDT