Re: UTF-8 <> UCS-2/UTF-16 conversion for library use

From: Mark Davis (mark@macchiato.com)
Date: Mon Sep 24 2001 - 12:57:11 EDT


> For this situation you have a good point. For others, however, the
> extra data space of UTF-32 is bound to be lower cost than having to check
> every character for special meaning (i.e. surrogate) before passing it on.

First, it is generally faster to test something in a cache than it is to
fetch it into the cache in the first place.

Second, very often (perhaps a majority of the time) if you code well you
don't have to check every code unit for special meaning. For example, in ICU
collation we basically map code units into a table of weight values. If the
weight is marked with a special bit, it needs to be handled specially. Such
special handling includes contractions, expansions, Hangul, etc. One of
these special values marks a lead (high) surrogate. There is no extra cost
for BMP characters in collation, since we have to test for the special bit
anyway.

Another case is binary string search. By the way that UTF-16 (and UTF-8) are
constructed -- with no overlap -- a standard string search algorithm works
fine. Whether you search for "abc", or for "a\uD800\uDC00c", you don't have
to check for any surrogate code units in your processing.

Mark
—————

Δός μοι ποῦ στῶ, καὶ κινῶ τὴν γῆν — Ἀρχιμήδης
[http://www.macchiato.com]

----- Original Message -----
From: "Ayers, Mike" <Mike_Ayers@bmc.com>
To: <unicode@unicode.org>
Sent: Monday, September 24, 2001 9:23 AM
Subject: RE: UTF-8 <> UCS-2/UTF-16 conversion for library use

>
> > From: Asmus Freytag [mailto:asmusf@ix.netcom.com]
> > Sent: Sunday, September 23, 2001 02:24 AM
>
> > The typical situation involves cases where large data sets
> > are cached in
> > memory, for immediate access. Going to UTF-32 reduces the
> > cache effectively
> > by a factor of two, with no comparable increase in processing
> > efficiency to
> > balance out the extra cache misses. This is because each
> > cache miss is
> > orders of magnitude more expensive than a cache hit.
>
> For this situation you have a good point. For others, however, the
> extra data space of UTF-32 is bound to be lower cost than having to check
> every character for special meaning (i.e. surrogate) before passing it on.
>
> > For specialized data sets (heavy in ascii) keeping such a
> > cache in UTF-8
> > might conceivably reduce cache misses further to a point
> > where on the fly
> > conversion to UTF-16 could get amortized. However, such an
> > optimization is
> > not robust, unless the assumption is due to the nature of the
> > data (e.g.
> > HTML) as opposed to merely their source (US). In the latter
> > case, such an
> > architecture scales badly with change in market.
>
> Maybe, maybe not. Latin characters are in heavy use wherever
> computers are, at least for now.
>
> > [The decision to use UTF-16, on the other hand, is much more robust,
> > because the code paths that deal with surrogate pairs will be
> > exercised
> > with low frequency, due to the deliberate concentration of nearly all
> > modern-use characters into the BMP (i.e. the first 64K).]
>
> Funny. You see robustness, I see latent bugs due to rarely
> exercised code paths.
>
>
> /|/|ike
>
>



This archive was generated by hypermail 2.1.2 : Mon Sep 24 2001 - 11:55:59 EDT