From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Jan 22 2004 - 12:42:55 EST
Doug Ewell wrote:
> BOCU-1 might solve this problem, but multiplying and dividing by 243
> doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by the
> claim in UTN #6 that converting Hindi text between UTF-16 and BOCU-1
> took only 45% as long as converting it between UTF-16 and UTF-8.)
"claim"? That hurts...
I did measure these things, and the numbers in the table are all from my measurements. I also
included the type of machine I used, etc. (http://www.unicode.org/notes/tn6/#Performance)
The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that BOCU-1 goes into single-byte
mode for small scripts like Hindi. Single-byte mode only performs a subtraction, no div/mod or even
bit-shifting, and writes/reads only one byte per character. It is also optimized in ICU with a tight
inner loop.
UTF-8 on the other hand encodes Hindi with 3 bytes per character and has to perform the bit-shifting
and write to/read from more memory locations.
It's the same for Greek/Russian/Arabic etc., although to a lesser degree because it's single bytes
with BOCU-1 vs. only 2 bytes per character with UTF-8.
The fact that BOCU-1 not only achieves good compression (and binary order and MIME text/
compatibility) but also reasonable conversion performance encouraged Mark and me to publish it.
UTF-8 is useful because it's simple, and supported just about everywhere - but it's otherwise hardly
optimal for anything.
If you want high-speed, compact encoding, use SCSU. If you want good speed, compact encoding, and
binary order and/or MIME compatibility, use BOCU-1. Make sure that both sides of the wire know
what's going across.
markus
This archive was generated by hypermail 2.1.5 : Thu Jan 22 2004 - 13:35:21 EST