Re: [long] Use of Unicode in AbiWord

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Thu Mar 18 1999 - 18:15:56 EST


schererm@us.ibm.com wrote on 1999-03-18 20:49 UTC:
> you will still need to compare for each byte if it is <128 to pass it
> through unchanged. doing so on 16b or even 32b should not cost much more,
> if any. using a 16b x-font with UTF-8 should degrade your performance.

Such ad-hoc predictions of degraded performance are very dangerous
without actual measurements.

Modern super-scalar processors with deep memory hierarchies and complex
compiler optimization stages make it *extremely* difficult to predict
which code or data structure variant is more efficient. Old rules of
thumb and "common sense" are not of much use any more for distinguishing
more and less performant algorithms of comparable complexity on a late
1990s processor. Surprises are frequent. Design decisions on performance
grounds should today only be made after real measurements and much of
what you learned 10 years ago about manual optimization is obsolete
these days.

Just a few counterarguments: 16-bit strings require for most languages
more cache reloads, and it is perfectly possible that some apparently
less efficiently looking UTF-8 algorithm suddenly performs faster than
the more efficiently looking UTF-16 variant. You can efficiently
implement the <128 test for 8 characters at a time on a modern 64-bit
processor, and your C library should do this invisibly for you anyway in
mbtowcs(). In addition, in this example the UTF-8 to 16-bit conversion
is only a negligible small amount of the computation necessary to
actually display the glyph, such that the difference shouldn't matter
anyway. PCs are orders of magnitude too fast today anyway, and many
applications are desperately looking for useful things to do between the
keystrokes of the horribly slow users ... ;-)

As far as PS and LS is concerned: Since we introduce with UTF-8 anyway a
not much backwards compatible new encoding, there is nothing wrong with
recycling the old C0 codes at the same time. I never understood the
point why anything is solved by introducing two additional control
characters if we have with the existing 32 already over two dozen unused
ones. Just let say 0x0a become the line separator and 0x0b the paragraph
separator if you define a new set of formatting codes anyway. I wouldn't
worry too much about the PS and LS codes of Unicode. It would have been
nice if Unicode gave meaning to the existing C0 positions instead of
adding even more control codes. We have to send everything through
converters anyway, so backwards compatibility is not that much of an
issue.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT