Here's an update on the timings of various 16 bit string compares.
Test File u_strcmp u_strcmpCodePointOrder wcscmp (Windows C Lib function)
---------------------------------------------------------------
Asian Names 37 40 36
Latin Names 56 59 60
Times are in ns on a 866 MHz Pentium III, running Windows 2000.
The test is a binary search within a list of roughly 10000 names.
The bottom line is that the cost to compare UTF-16 strings in code point order is very small - the 3 ns is close to being in the noise for these measurements.
The implication for the UTF-8S debate, as best I understand it, is that solving the UTF-8 / UTF-16 sort order incompatibility by comparing the UTF-16 strings in code point order would be perfectly viable from a performance standpoint. (UTF-8 strings naturally compare in code point order)
-----
The differences from the times quoted in Mark's note, below, are these:
1. Loop overhead is removed from the measurement. (The loop overhead is
actually quite a bit smaller than the difference in numbers would seem
to suggest, because of processor cache interactions. For the new numbers,
the string data, because of the structure of the test, will always be
in L1 cache. With the old numbers, data cache misses are significant,
making the old results perhaps more representative real use.)
2. I tweaked the ICU code to make it a tiny bit faster.
Andy Heninger
IBM, Cupertino, CA
heninger@us.ibm.com
----- Original Message -----
From: Mark Davis
To: Unicore ; Unicode
Sent: Tuesday, June 26, 2001 3:17 PM
Subject: Cost of code point order comparison
I asked our performance czar to run a test comparing the performance of the two ICU utf-16 strcmp routines (UTF-16 binary order and UTF-8/32 binary order). While I want to caution that the results are preliminary, here they are:
"Test File u_strcmp u_strcmpCodePointOrder
---------------------------------------------------
Asian Names 81 ns 83 ns / call
Latin Names 127 ns 124 ns
The test is a binary search of a sorted list of roughly 10000 names. The Asian names are quite a bit shorter, which probably accounts for the time difference between them and the Latin names.
The code path through the u_strcmpCodePointOrder function has (statistically, anyhow) exactly one added simple if relative to u_strcmp. The timing differences are repeatable on my machine, but are probably mostly noise from code alignment and the like..."
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 13:48:07 EDT