Re: UTF-8 <> UCS-2/UTF-16 conversion for library use

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Sep 23 2001 - 05:23:51 EDT


At 10:21 AM 9/21/01 -0700, Kenneth Whistler wrote:

>It is my impression, however, that most significant applications
>tend, these days, to be I/O bound and/or network
>transport bound, rather than compute bound.
...
>We don't hear
>much, anymore, about how "wasteful" Unicode is in its storage
>of characters.

These points are well taken, particularly in the context of the discussion
in which they appeared. However, there are still situations where doubling
of storage, like going from UTF-16 to UTF-32 for average data, can have
direct impact.

The typical situation involves cases where large data sets are cached in
memory, for immediate access. Going to UTF-32 reduces the cache effectively
by a factor of two, with no comparable increase in processing efficiency to
balance out the extra cache misses. This is because each cache miss is
orders of magnitude more expensive than a cache hit.

For specialized data sets (heavy in ascii) keeping such a cache in UTF-8
might conceivably reduce cache misses further to a point where on the fly
conversion to UTF-16 could get amortized. However, such an optimization is
not robust, unless the assumption is due to the nature of the data (e.g.
HTML) as opposed to merely their source (US). In the latter case, such an
architecture scales badly with change in market.

[The decision to use UTF-16, on the other hand, is much more robust,
because the code paths that deal with surrogate pairs will be exercised
with low frequency, due to the deliberate concentration of nearly all
modern-use characters into the BMP (i.e. the first 64K).]

A./



This archive was generated by hypermail 2.1.2 : Sun Sep 23 2001 - 04:09:05 EDT