From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 12 2010 - 06:28:11 CDT
"Kenneth Whistler" <kenw@sybase.com> wrote:
> [ snipping all the word breaking discussion, which I am not going
> to comment on ... ]
>
> CE Whitehead said:
>
> > I collate as follows (note that i' is equivalent to i with accent grave):
> >
> > (EXAMPLE 1 -- my sort)
> > di Silva, Fred
> > di Silva, John
> > di Si'lva, Fred
> > di Si'lva, John
> > Disilva, Fred
> > Disilva, John
That's exactly how I would sort it in French, word by word, so all the
names with the separated particles would come before. This will work
only if level 2 backword reordering of collation weight is limited to
word spans.
Currently, the UCA algorithm completely ignores this, or does not
include any place for introducing a word-breaker to limit the
backwards reordering, so effectively, using the French collation with
the current algorithm, it would first group them on the primary level
(considering only word separation as space, and ignoring case) as:
* Group 1 (DI SILVA FRED)
> di Si'lva, Fred
> di Silva, Fred
* Group 2 (DI SILVA JOHN)
> di Silva, John
> di Si'lva, John
* Group 3 (DISILVA FRED)
> Disilva, Fred
* Group 4 (DISILVA JOHN)
> Disilva, John
Then the secundary level would split and reorder these groups using
the accent differences:
* Group 1 (DI SILVA FRED)
* di Silva, Fred
* di Si'lva, Fred
* Group 2 (DI SILVA JOHN)
* di Silva, John
* di Si'lva, John
* Group 3 (DISILVA FRED)
* Disilva, Fred
* Group 4 (DISILVA JOHN)
* Disilva, John
(The case differences at the ternary level, and ignorable differences
sorted in binary order in the final level have no effect here).
The case becomes even more tricky when sorting correctly the following
list (note the difference of accents on the second 'e' in groups 1 and
5, as the most significant secondary difference is still on the third
'e', before considering the relative order of accents on the second
'e' ; in this example, I only use lowercase because this only plays a
role at the forwards ternary level):
1.
* délègue
* délégué (the final 'é' sorts after the final 'e')
2.
* déléguée
3.
* déléguées
4.
* déléguer
5.
* délègues
* délégués (the final 'é' sorts after the final 'e')
6.
* délèguent
This is a case where the backwards differences plays a role in French,
but this only occurs on a word-by-word base, just like in your example
above.
But if the same list of words is included at end of similar sentences
(identical at the primary level) where there the first words would
only differ by their accents, these first words should really take
precédence to the relative order of the final words listed above.
If we don't limit the backwards reordering, then all accents in the
full sentences will be reordered, so this is the final word that will
drive the order. not only this is incorrect, but having to fully
reorder the secondary weights in the full sentence has a significant
memory and performance impact, when we could just limit ourself to
comparing the sentences word by word, starting from the beginning of
each sentence.
Such case will often occur when sorting article names (e.g. in
Wikipédia), book and movie titles... And when there will be list of
people names, where the first (given) name is placed after the last
(birth) name, separated by a comma, or when sorting titles with an
additional precision at end between parenthèses (such as a country
name to disambiguate city names, e.g. in Wikipédia) :
These final precisions (added in separate words) clearly have a low
priority in the sort order, so the backwards reordering should really
not place them with a higher priority in the secondary collation
level. If we use a word breaker, not only the sort order will be
correct, but we will also gain performance (due to shorter buffering
of collation weights, we don't necessarily have to process all the
strings completely, but just the beginning most of the time).
So when you snipped completely the discussion about word-breaking
within collation, you completely forgot the main point of my message:
that the UCA algorithm does not include any entry points for insering
a word breaker within the algorithm (or any custom breaker that a
tailored collation would need, for example in a complex tailoring that
would recognize space-separated particles in proper names, as if this
space had a lower difference than a true word-separation)
Yes using a breaker in UCA collation is clearly optional (and not
absolutely needed if all levels are forewards in some language, as it
will have no visible effect on the collation order except for
increased performance), but it's absolutely essential if we use any
backwards reordering at any level.
The absence of the word-breaker makes the current UCA algorithm
unusable for French in practice (and also very costly in terms of
space, and unnecessarily slow), when sorting lists of items longer
than a single word : it can only work in practice when sorting basic
dictionary entries (for example it works extremely well when sorting
French Wiktionary categories, but it fails miserably when sorting
French Wikipedia categories, for which the expected backwards ordering
of level 2 has to become forewards, even if this gives non-optimal
sorts within some small groups of article titles).
And that's why I request comments about the effect of the backwards
ordering, the way it is currently specified.
But also because it can significantly improve the performance of
collation, even for English or Korean when comparing sentences (using
a syntaxic/lexical/morphologic/syllabic breaker, you just need to
compute the collation-strings for each word/syllable/morpheme
separately, starting from the beginning, and the breaker can also be
used to discard non significant parts of sentence, such as included
parts between parentheses).
This also generalizes the concept of sorts with compound keys (such as
SQL selects with ORDER BY clauses), that implicitly use a column
breaker within a result row.
-- Philippe.
This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 06:32:14 CDT