From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 12 2010 - 16:50:27 CDT
"Mark Davis ☕" <mark@macchiato.com>
> A few comments.
>
> A tailoring that sorts word-by-word would certainly be possible, and is
> certainly allowed by the UCA. As to whether it is necessary or not, that is
> another matter. Sorting is about matching user expectations, and of all of
> the French that I have ever asked, none except for internationalization
> mavens have ever even noticed or heard that French sorts accents backwards
> at all, let alone that it is word-by-word. And the number of times that
> occurs in actual practice is exceedingly small -- a reason for nobody to
> actually know about it!
>
> And when we are talking about multiple words, it is even more obscure.
> Remember that for it to make a difference whether done word-by-word or not,
> every single base letter in the two phrases has to be identical, *and* two
> words in different positions have to have accents on different base
> characters. In constructed examples, certainly possible; in real life, rare.
And such case does occur in large lists (such as when sorting
categories in Wikipedia as I said)
> (Frankly, sometimes I regret our adding French sorting to ICU at all. It
> complicates the code and slows down French sorting significantly, for
> vanishingly small value. However, if you or someone else wanted to add
> word-by-word sorting on top of ICU's implementation, I think the API may be
> sufficient to do that.)
A basic word-breaker using inly the space separator would marvelously
improve the speed of French sorting even if backwards ordering occurs,
just because it would significantly improve the data locality in the
implementation and would considerably reduce the reallocations and use
of large buffers.
A word breaker anyway has a very tiny impact on performance, compared
to the gain you can have by using smaller data buffers: most of the
cost of UCA falls within the huge cost of normalizations and complex
analysis of M-to-N mappings of collation elements, and in
interpreting/locating the DUCET table entries.
Remember that the longest word in standard French,
"anticonstitutionnellement", has no accents and includes only 25
letters creating, so the effect of backward ordering will be limited
to a very small buffer for 25 collation weights, instead of possibly
thousands or millions when sorting large texts (having to sort lists
of phrases that are longer than 25 characters is definitely not an
exceptional situation).
With a word breaker, you can significantly improve the speed of almost
all the steps of the UCA algorithm (including normalizations and
looking for M-to-N collation elements), just because you don't have to
perform at once with the entire text : you can to that on the fly, and
can stop processing the end of the text very rapidly in almost all
practical cases (including in English).
Also remember that collation strings will likely never be stored
entirely for long texts. They will be truncated to a reasonnable
maximum size, and if there's still an equality between a few adjascent
list entries remaining in equality groups, you can still compare them
more precisely using a UCA compare (because the collation string will
only contain the leading part of the primary weights), by computing
the collation levels one by one and progressively.
To be able to compute truncated collation strings, you just need to
get sure that you won't break in the middle of a grapheme cluster. A
basic word-breaker can ensure it very simply, if it just breaks on
whitespaces or on a few punctuations, even before applying the complex
normalizations.
Having to sort very long multi-field texts is also extremely common in
database queries using SQL's ORDER BY clause, which will benefit a lot
of being able to pack several leading fields at once in the same
collation string. Ordered SQL requests are used extremely frequently,
and the need for performance is not superficial.
This need for sorting multi-field texts (whose source is already
broken into separate fields without even using a word breaker) and is
in fact so frequent in so many applications, that it should be
described in the standard algorithm : adding a standard entry for
field separators in the DUCET is very worth the value
And it would avoid a confusion made by various people, including on
English Wikipedia when sorting tables where I demonstrated that the
separator they used, SPACE+ZWNBSP <0020,FEFF>, was definitely broken :
I insisted to demonstrate that SPACE+EXCLAMATION was a far better
separator (in fact it is the lowest possible according to
implementation constraints where controls are forbidden, and a single
SPACE alone was still not enough in sort keys of very tables tables
containing people names or toponyms) for multi-field entries occuring
when sorting Wikipedia tables, even if only a level-1 collation was
performed, and this was effectively changed after my request).
Is it really to much to request the inclusion or reservation of a
standard "empty" collation element for field separators in the DUCET,
with the lowest primary weight possible, and even lower than the
primary weight assigned to ignored characters ? (That's why I also
suggested reserving the negative weight -1, because ignored collation
elements get a 0 primary weight).
At least this would effectively document that, when a field separator
needs to be used used in collation strings or sort keys, it should
have the lowest weight supported by the implementation and NOT the
highest weight or a very high one (as it was incorrectly assumed in
English Wikipedia), even if a full 4-level UCA implementation is not
used.
Philippe.
This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 16:51:39 CDT