From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Feb 05 2008 - 11:09:21 CST
Hans Aberg wrote:
> Envoyé : lundi 4 février 2008 14:49
> À : Jeroen Ruigrok van der Werven
> Cc : unicode@unicode.org
> Objet : Re: Factor implements 24-bit string type for Unicode support
>
>
> On 3 Feb 2008, at 22:45, Jeroen Ruigrok van der Werven wrote:
>
> > http://factor-language.blogspot.com/2008/02/24-bit-strings-are-in.html
> >
> > Personally I'd wonder about this. I can understand the desire to
> > shave bytes
> > off in-memory, but given a lot of platforms having issues with
> > non-32 bit
> > boundaries and the resulting performance or alignment issues I
> > seriously
> > wonder if it is worth the trade off of not just using UCS4 internally.
>
> I think that 32-bit is probably best for internal use in programs for
> speed, avoiding alignment problems; the best way to actually know is
> to do some profiling. Externally, for distributed files, UTF-8 seems
> best, because most agree on how to sort out the bits the bytes.
At first, you should not assert that platforms have "problems" when handling
data at non "é-bit boundaries. It's true that they may suffer some
additional cycles penalty, but this highly depends on the structure of
memory caches, and in fact it will most often be much more costly to suffer
a cache miss penalty just because you have wasted 25% of fast cache
capacity.
Reading memory byte per byte, is not so costly as it appears, and rebuilding
a 32-bit entity from 3 separate bytes has a very negigeable cost in
comparision to the memory access time, which greatly depends on data
locality (in memory or even worse when this memory is paged out to disk).
The performance penalty of a very basic compression (that needs to be
decompressed with just three shifts and two ors that are paralleled in
today's processors with multiple pipelines) is really very small compared to
the benefit of using it.
So yes it's true that UTF-32 will be more efficient, but only when handling
very small volume of data (below 1MB in a single threaded environment, or
below about 64KB in a multithreaded environment).
Today, almost all environments are massively multithreaded, and run on OSes
with many concurrent processes as well; the multiple cores run with their
own very fast data cache, but each one is limitd in size, and there are
everal stages of caches, including in the OS itself with paged out memory,
and modern deployments where data is located on another remote host or
server. At the same time, the total size of databases has also exploded, and
computers are used to process much more massive quantities of text.
As always, this is the bandwidth of the datapipes that is limiting the
performance, and come basic compression that saves 25% of data size is
certainly a good thing, if it helps reducing the cache misses in one of the
various stages of data caches that are now used everywhere. You cannot
conclude as a general rule that UTF-32 will be always better, and
experiences shows that data locality (and reduced data sizes) plays a large
role in increased performances, given that the cost of
compression/decompression is always falling with evolution of technologies
according to Moore's Law.
The difficulty is to find the threshold at which compression saves time:
it's no more possible to determine it in a precompiled rule without actual
performance tests on the target platform (because there are thousands
possible configurations of CPU models, CPU speed, internal cache sizes,
external buses and caches, external disks...). You can just estimate that
such threshold does exist, and a good software should no more be written by
assuming a unique external storage format or a specific compression scheme.
This archive was generated by hypermail 2.1.5 : Tue Feb 05 2008 - 12:14:26 CST