RE: [long] Use of Unicode in AbiWord

From: Christophe PIERRET (cpierret@businessobjects.com)
Date: Fri Mar 19 1999 - 04:18:38 EST


On March 19, 1999 12:20 AM, Markus Kuhn [SMTP:Markus.Kuhn@cl.cam.ac.uk]
wrote:
> schererm@us.ibm.com wrote on 1999-03-18 20:49 UTC:
> > you will still need to compare for each byte if it is <128 to pass it
> > through unchanged. doing so on 16b or even 32b should not cost much
more,
> > if any. using a 16b x-font with UTF-8 should degrade your performance.
>
> Such ad-hoc predictions of degraded performance are very dangerous
> without actual measurements.
>
> [...]
> Just a few counterarguments: 16-bit strings require for most languages
> more cache reloads, and it is perfectly possible that some apparently
> less efficiently looking UTF-8 algorithm suddenly performs faster than
> the more efficiently looking UTF-16 variant. You can efficiently
> implement the <128 test for 8 characters at a time on a modern 64-bit
> processor, and your C library should do this invisibly for you anyway in
> mbtowcs(). In addition, in this example the UTF-8 to 16-bit conversion
> is only a negligible small amount of the computation necessary to
> actually display the glyph, such that the difference shouldn't matter
> anyway. PCs are orders of magnitude too fast today anyway, and many
> applications are desperately looking for useful things to do between the
> keystrokes of the horribly slow users ... ;-)

I had a very surprising benchmark with a UTF-8 algorithm going 30% faster
than UTF-16 algorithm !
The benchmark was with a collation algorithm implementation for both UTF-16,
and UTF-8.
(Using Visual C++ 5 on a Pentium II 400 with 128Mb memory)
On latin script data ( 96% of characters were ASCII ), UTF-8 version
outperformed slighlty UTF-16.
Since the only difference is that I extract one UCS-4 at a time from the
UTF-8 string and apply the same operations as for UTF-16, I expected it to
be slower ...
I found no significant intelligible difference even at the assembler level.
The only explanation I could find was in the fact that UTF-8 ( for latin
script) used less memory to store strings.
And even if UTF-8 version does more computations (in registers), it was
faster.
The balance between memory access cost and register computations cost seems
to change ...

Christophe Pierret
Business Objects S.A.
http://www.businessobjects.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT