From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Apr 28 2006 - 18:01:47 CST
> so that averages to 73,000 per second.
For comparison, I just wrote a quick&dirty test case with ICU4J, which
uses UCA 4.1.0. It was run on a Pentium M Processor 760 (2.00GHz, 533MHz
FSB). According to the test run, it does a string comparison between
each pair of adjacent strings (123,086 comparisons) at 1,309,436
comparisons per second. Sortkeys generated per second were 984,704.
That's in Java; the C version is much faster.
However, the Unicode conformance test is not the data that the code was
optimized for. Our design center was:
A. produce the minimum length (in bytes) for sort keys. This was the metric
chosen because presumably if you are going to use sort keys, you are going to
use them many, many times, and also typically store them on disk. Minimizing the
storage reduces the impact on memory, caches, and disk access, so we are willing
to take more time to generate shorter versions.
B. for string comparison, optimize for sorting (common algorithms like
quicksort) and binary search. Both of these processes end up comparing strings
that are relatively "close" most of the time, since you progressively "zero in"
on the area where the string belongs. That is, normally you much more often
compare "Davis" to "Daniels", "Davies" or "Davidson" than you do to "Smith" or
"Blackbridge". Because of this, we have special fast paths to optimize for close
strings.
C. optimize for 'real data', not random data. That is, use a set of data for a
number of different languages consisting of common-case data for that language,
such as personal names.
There are some charts comparing ICU performance on
http://icu.sourceforge.net/charts/collation_icu4j_sun.html
http://icu.sourceforge.net/charts/collation_icu4c_glibc.html
Here is a quick dump of data for Java:
File: CollationTest_NON_IGNORABLE.txt
Punctuation: Normal
Level: 3
Total strings: 123,087
Total code points: 254,602
COMPARISON
Total seconds: 0.094
Comparisons per second: 1,309,436.170212766
SORTKEY GENERATION
Total seconds: 0.125
Sortkey bytes per code point: 4.8919882797
Sortkeys generated per second: 984,704
Sample Sort Keys
[6] 06 01 05 01 05 00
a [6] 28 01 05 01 05 00
A [6] 28 01 05 01 8F 00
å [7] 28 01 86 99 01 06 00
fish [9] 32 38 4C 36 01 08 01 08 00
File: CollationTest_SHIFTED.txt
Punctuation: Shifted
Level: 4
Total strings: 123,087
Total code points: 254,602
COMPARISON
Total seconds: 0.11
Comparisons per second: 1,118,972.7272727273
SORTKEY GENERATION
Total seconds: 0.125
Sortkey bytes per code point: 5.5544182685
Sortkeys generated per second: 984,704
Sample Sort Keys
[5] 01 01 01 06 00
a [8] 28 01 05 01 05 01 23 00
A [8] 28 01 05 01 8F 01 23 00
å [9] 28 01 86 99 01 06 01 24 00
fish [11] 32 38 4C 36 01 08 01 08 01 26 00
Mark
Mike wrote:
> Hello again,
>
> Now that I have working code, I'm curious if my
> performance is any good. I ran both conformance
> tests and calculated the time spent creating the
> sort keys and comparing them. For the SHIFTED
> test, the time required was 1.83 seconds, and the
> time for the NON_IGNORABLE test was 1.54 seconds.
> There are roughly 123,000 strings in each file,
> so that averages to 73,000 per second.
>
> The tests were run on a 3.2 GHz Pentium D (Windows
> XP). The UCA version is 4.1.0.
>
> So, am I in serious need of optimization?
>
> Thanks for any insight.
>
> Mike
>
>
>
This archive was generated by hypermail 2.1.5 : Fri Apr 28 2006 - 18:05:01 CST