From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 12 2010 - 16:06:37 CDT
Philippe Verdy said:
> If we don't limit the backwards reordering, then all accents in the
> full sentences will be reordered, so this is the final word that will
> drive the order. not only this is incorrect,
I understand that you think that the ordering should be done
word-by-word, with the French accent consideration being done
backward only within the scope of the individual word comparisons, but...
> ... but having to fully
> reorder the secondary weights in the full sentence has a significant
> memory and performance impact, when we could just limit ourself to
> comparing the sentences word by word, starting from the beginning of
> each sentence.
That claim is simply wrong.
A. If you are *storing* keys, then the memory size of the resultant
keys is no larger for a (well-designed) backwards French accent
collation, than for forward accent collation. And then the comparison
time for binary comparison of the full-formed keys is also no different,
on average.
B. If you are doing *incremental* comparisons, then it would be
foolish to construct entire keys and scan all the way to the
end of long strings looking for accents to reverse, before
making comparison decisions based on primary weights at the
beginning of strings. Any well-designed incremental UCA string
comparison should be checking primary values distinctly from
secondary values *incrementally*, so that a difference, say,
between "d" and "e" in the first character does not wait for
determination of the presence or absence of an accent at the
end of a string -- perhaps 250 positions down the string or more --
before deciding the order relation between the two strings.
Yes, in the incremental comparison case, there *is* an additional
cost for processing for backwards accents, but it only tends
to be significant for the relatively unusual cases of comparison
of strings which are all equal at the primary level and which only
differ by accents.
And doing string compares word by word introduces its own level
of complication into comparison operations, so it is not at
all obvious that processing that way would improve performance
over an incremental comparison without word boundaries.
> Such case will often occur when sorting article names (e.g. in
> Wikipédia), book and movie titles... And when there will be list of
> people names, where the first (given) name is placed after the last
> (birth) name, separated by a comma, or when sorting titles with an
> additional precision at end between parenthèses (such as a country
> name to disambiguate city names, e.g. in Wikipédia) :
Names of people in comma-delimited lists, possibly.
But I challenge you to provide some real world examples of
your last case -- where you have some pair of city names
that differ *only* by accent placement, disambiguated by
a following parenthetical country (or region, or whatever, I
don't care) name, where the country names *also* only
differ by accent placement. And if you come up with such
an example, explain how likely it is that anyone using such
a list, amongst a list of thousands others like it, would
be to notice that there was a backwards accent weighting
issue in the order of the two entries.
> These final precisions (added in separate words) clearly have a low
> priority in the sort order, so the backwards reordering should really
> not place them with a higher priority in the secondary collation
> level.
Theoretically, perhaps. But practically this will make no
difference, I predict.
> If we use a word breaker, not only the sort order will be
> correct, but we will also gain performance (due to shorter buffering
> of collation weights, we don't necessarily have to process all the
> strings completely, but just the beginning most of the time).
Not true. See the discussion above.
> So when you snipped completely the discussion about word-breaking
> within collation, you completely forgot the main point of my message:
I will remind you that I was not responding to *you*, Philippe, when
I responded to CE Whitehead. So I wasn't forgetting your main point
at all -- just not responding to it.
> that the UCA algorithm does not include any entry points for insering
> a word breaker within the algorithm (or any custom breaker that a
> tailored collation would need, for example in a complex tailoring that
> would recognize space-separated particles in proper names, as if this
> space had a lower difference than a true word-separation)
Yes, we can take it as stipulated that the UCA algorithm
"does not include any entry points for inser[t]ing
a word breaker within the algorithm." That was by design,
as the algorithm is designed for collation of strings --
not for word-by-word collation of delimited, structured text.
> Yes using a breaker in UCA collation is clearly optional
Yes, anyone who wants to enhance UCA collation to implement
collation that takes into account word-by-word behavior can
clearly do so.
> (and not
> absolutely needed if all levels are forewards in some language, as it
> will have no visible effect on the collation order except for
> increased performance), but it's absolutely essential if we use any
> backwards reordering at any level.
I disagree with that assessment.
> The absence of the word-breaker makes the current UCA algorithm
> unusable for French in practice (and also very costly in terms of
> space, and unnecessarily slow), when sorting lists of items longer
> than a single word
Figures, please, to back up that claim.
> : it can only work in practice when sorting basic
> dictionary entries (for example it works extremely well when sorting
> French Wiktionary categories, but it fails miserably when sorting
> French Wikipedia categories, for which the expected backwards ordering
> of level 2 has to become forewards, even if this gives non-optimal
> sorts within some small groups of article titles).
Quantify "fails miserably", when the correctness issue you
cite would only apply in cases in comparison of strings which
otherwise have no primary differences at all.
> And that's why I request comments about the effect of the backwards
> ordering, the way it is currently specified.
What comments are you requesting, exactly? If they are along the lines
of what you have just claimed in this note, then I can predict
you are going to find considerable disagreement.
> But also because it can significantly improve the performance of
> collation, even for English or Korean when comparing sentences (using
> a syntaxic/lexical/morphologic/syllabic breaker, you just need to
> compute the collation-strings for each word/syllable/morpheme
> separately, starting from the beginning,
Again, I don't know where you are getting these claims. If you
are computing whole keys for storage, there are no significant
differences.
If you are doing *incremental* string comparison, such an algorithm
*already* is optimized to make order determinations as early in
a string as possible.
If you force word-by-word comparison, you are likely to actually
*slow down* string comparison, as compared to an optimized
incremental string comparison, as you then need to parse off
each unit (word or whatever, by whatever criterion), then
compute a key for each pair, and compare the keys. That is
significantly *more* processing than a well-optimized incremental
string comparison. If you think otherwise, then I'd suggest
you provide an actual algorithm for doing the kind of processing
you envision here, and let people analyze the algorithm to see
if it could actually out-perform incremental string comparisons
for the same input.
I'm not denying that there are cases where word-by-word (or
other kinds of unit) comparisons are actually desirable -- I
just don't buy your argument that this is a route for improved
comparison speed over UCA, even with French rules for backwards
accent comparison.
> and the breaker can also be
> used to discard non significant parts of sentence, such as included
> parts between parentheses).
Huh? That is just preprocessing to delete portions of strings
before calculating keys. If you want to do so, be my guest,
but building in arbitrary rules of content suppression into
the UCA algorithm itself is a non-starter.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 16:08:16 CDT