From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 12 2010 - 20:16:00 CDT
Philippe Verdy said:
> A basic word-breaker using inly the space separator would marvelously
> improve the speed of French sorting even if backwards ordering occurs,
> just because it would significantly improve the data locality in the
> implementation and would considerably reduce the reallocations and use
> of large buffers.
I disagree. Any implementation of an incremental comparison based on
UCA which is making significant use of reallocations and large buffers
isn't optimized in the first place.
An optimized incremental comparison doesn't consist of breaking
out a chunk of string 1, computing a complete sort key for it,
breaking out a chunk of string 2, computing a complete sort key for
it, and then comparing the keys.
An optimized incremental comparison walks down both string 1 and
string 2, pulling up the relevant comparable collation elements
and tracking state for primary, secondary, and tertiary weights
as it goes, bailing with an answer at the first available point
that the comparison becomes definitively non-equal. That can be
done without *any* buffer allocation or reallocation, using
relatively tiny buffers allocated on the stack, and definitely
has better "data locality" than first trying to chunk strings at
word boundaries.
> A word breaker anyway has a very tiny impact on performance, compared
> to the gain you can have by using smaller data buffers: most of the
> cost of UCA falls within the huge cost of normalizations and complex
> analysis of M-to-N mappings of collation elements, and in
> interpreting/locating the DUCET table entries.
That much I *do* agree with, however. If you have to normalize
entire strings before starting comparison, that is a significant
processing cost. The answer to that is architectural -- if you
have to repeatedly normalize strings before comparing them,
then store the normalized strings and compare *those* instead.
Also, even in the case where dealing with "wild" data input that
cannot be assumed to be pre-normalized, a test to check whether
a string is in NFC is *much* quicker than actually doing the
normalization, and for most data the answer is yes, anyway.
As for the M-to-N mappings and the collation element lookup,
that is unavoidable, however else you build the comparison, so
you just do the best job to make the lookup of collation
elements fast.
But your argument that a word breaker "has a very tiny impact
on performance" compared to normalization or collation element
lookup is moot, if the addition of the word breaker isn't
going to improve the incremental comparison, anyway -- which
is my argument. And unless your word breaker really is
a trivial one (i.e. break on spaces), it isn't correct to
maintain that it is a "tiny impact" compared to normalization,
for example.
> Remember that the longest word in standard French,
> "anticonstitutionnellement", has no accents and includes only 25
> letters creating, so the effect of backward ordering will be limited
> to a very small buffer for 25 collation weights, instead of possibly
> thousands or millions when sorting large texts ...
Or *zero* bytes for extra buffers, with a properly designed
incremental comparison algorithm -- even if the strings to compare
are megabytes long.
[ The remainder of this email has been truncated, as the string
was too long. ;-) ]
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 20:19:08 CDT