Technical Notes |
Version | 1 |
Authors | Ken Whistler |
Date | 2010-07-28 |
This Version | http://www.unicode.org/notes/tn34/tn34-1.html |
Previous Version | n/a |
Latest Version | http://www.unicode.org/notes/tn34/ |
This document presents a case study of collation issues, using data from a French language topic list to illustrate alternative orders and how to obtain them. It also discusses implementation issues for ordering lists of this type.
This document is a Unicode Technical Note. Sole responsibility for its contents rests with the author(s). Publication does not imply any endorsement by the Unicode Consortium. This document is not subject to the Unicode Patent Policy.
For information on Unicode Technical Notes including criteria for acceptance, see http://www.unicode.org/notes/.
Recent discussion on the Unicode general email list has raised a number of issues regarding collation and the Unicode Collation Algorithm (UCA) specifically.
In particular, there has been much discussion about whether having a specific word-breaking mechanism built into the UCA would be an improvement for it, as well as discussion of the problems of French accent weighting (reversed at the secondary level) for efficiency in handling certain kinds of list sorting in actual applications in production.
One of the cases which has been attested as illustrating the problems with the UCA and the need for a word-breaking mechanism and other changes is the sorting of topic lists (and other kinds of lists) on the French Wikipédia.
In lieu of attempting to discuss long examples in email, where the discussion cannot truly do justice to the examples or the problems they pose, I have written up this small technical note, so that I can examine some of the issues in the context of an extended example using actual data from the French Wikipédia.
To illustrate some of the problems, I have selected a sublisting from the "Toutes les pages" topical listing for the French Wikipédia. In particular:
http://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Toutes_les_pages&from=Fleur&to=Fleuve_Nelson
(This page was accessed and excerpted 2010-07-13. Note that the exact range of topics for a particular topic page listing will vary from day to day on the actual site, as entries are added or modified.)
The excerpted list of interest here, in its actual occurring order on that page is shown below. (I have deleted out intervening entries, so as to highlight the collation order issues for "Fleur" and "Fleur de lys" in the list.)
Fleur Fleur-De-Lys Fleur-de-Lys Fleur-de-lys Fleur (Disney, Bambi) Fleur (Homonymie) Fleur (homonymie) Fleur (personnage) Fleur (plante) Fleur (prénom) Fleur Adcock ... Fleur De Lampaul Fleur De Lys Fleur De Lys (Homonymie) Fleur De Lys (groupe) Fleur De Métal ... Fleur annuelle ... Fleur de Lune Fleur de Lys Fleur de Mai ... Fleur de lampaul Fleur de lis Fleur de lune Fleur de lys Fleur de lys (homonymie) Fleur de lys florencée Fleur de ma ville ... Fleur stabilisée Fleurac ... Fleurbaix Fleurdelisé Fleurdelysé Fleurdepied ... Fleurs de bach Fleurs de lis Fleurs de lys Fleurs de ruine ...
By inspection of the ordering of this list, it is clear that the ordering being used is a very simple one. It is a single-level sort, giving SPACE (and HYPHEN-MINUS and parentheses) a low primary weight. It also does not treat case distinctions as ignorable, but rather collates uppercase and lowercase as distinct letters in ASCII order. Thus it has "H" < "g" in one obvious instance. Except for the fact that this collation sorts "-" before SPACE, this ordering is otherwise what one would expect for a binary string sort.
Obviously this ordering is far from a desirable one for a long list like this.
A significant improvement in the order would result from a single, change in the collation to ignore case differences. The result of that change is shown below.
Fleur Fleur-De-Lys Fleur-de-Lys Fleur-de-lys Fleur (Disney, Bambi) Fleur (Homonymie) Fleur (homonymie) Fleur (personnage) Fleur (plante) Fleur (prénom) Fleur Adcock Fleur annuelle ... Fleur De Lampaul Fleur de lampaul Fleur de lis Fleur de Lune Fleur de lune Fleur De Lys Fleur de Lys Fleur de lys Fleur De Lys (groupe) Fleur De Lys (Homonymie) Fleur de lys (homonymie) Fleur de lys florencée Fleur de ma ville Fleur de Mai Fleur De Métal ... Fleur stabilisée Fleurac ... Fleurbaix Fleurdelisé Fleurdelysé Fleurdepied ... Fleurs de bach Fleurs de lis Fleurs de lys Fleurs de ruine ...
This does a much better job of grouping similar strings together that only differ by casing, and also eliminates ordering anomalies of the "H" < "g" variety.
However, the list order still has problems, in that it widely separates what would otherwise be considered closely related strings, such as "Fleur-de-lys", "Fleur de lys" and "Fleurdelysé".
Now let's consider what would happen to this list if ordered with a multi-level collation based on UCA. First, let's consider the default settings for the UCA implementation by the International Components for Unicode library [ICU]. That library does a full UCA multi-level collation. Its default settings differ from the defaults for UCA per se, in that ICU does not default to the "shifted" option for weighting. That means that the so-called variable elements (e.g., punctuation and symbols) are given primary weights, instead of being shifted to a weighting significance at the fourth level. Given the ICU default settings, the list would order as follows.
Fleur Fleur (Disney, Bambi) Fleur (homonymie) Fleur (Homonymie) Fleur (personnage) Fleur (plante) Fleur (prénom) Fleur Adcock Fleur annuelle ... Fleur de lampaul Fleur De Lampaul Fleur de lis Fleur de lune Fleur de Lune Fleur de lys Fleur de Lys Fleur De Lys Fleur De Lys (groupe) Fleur de lys (homonymie) Fleur De Lys (Homonymie) Fleur de lys florencée Fleur de ma ville Fleur de Mai Fleur De Métal ... Fleur stabilisée ... Fleur-de-lys Fleur-de-Lys Fleur-De-Lys ... Fleurac ... Fleurbaix Fleurdelisé Fleurdelysé Fleurdepied ... Fleurs de bach Fleurs de lis Fleurs de lys Fleurs de ruine ...
The main differences to be noted here from Order 2 are that lowercase letters are systematically ordered before uppercase characters, but with a lower level of significance than for the primary ordering differences (a < b < c, etc.). Also, the hyphen-minus character sorts after the space character, instead of before it.
This listing does a better job than Order 2, but still has the problem of widely separating "Fleur-de-lys", "Fleur de lys" and "Fleurdelysé", for example.
Now let's consider what would happen to this list if ordered with a multi-level collation using the UCA with its default settings and table. In this case, the collation would be done with the "shifted" option—something that can be accomplished with the ICU implementation by choosing the "Ignore Punctuation" option.
Fleur Fleurac Fleur Adcock Fleur annuelle Fleurbaix ... Fleur de lampaul Fleur De Lampaul Fleur de lis Fleurdelisé Fleur de lune Fleur de Lune Fleur de lys Fleur-de-lys Fleur de Lys Fleur-de-Lys Fleur De Lys Fleur-De-Lys Fleurdelysé Fleur de lys florencée Fleur De Lys (groupe) Fleur de lys (homonymie) Fleur De Lys (Homonymie) Fleur de ma ville Fleur de Mai Fleur De Métal Fleurdepied ... Fleur (Disney, Bambi) Fleur (homonymie) Fleur (Homonymie) Fleur (personnage) Fleur (plante) Fleur (prénom) ... Fleurs de bach Fleurs de lis Fleurs de lys Fleurs de ruine ... Fleur stabilisée ...
In particular note that (by design), UCA with the "shifted" option treats the weighting of punctuation and spaces as less significant than tertiary-level (casing) distinctions. This has the intended effect of grouping together strings that only differ by the presence or absence of punctuation, as in the case of "Fleur-de-lys", "Fleur de lys", and "Fleurdelysé", which now occur closely together in the list.
The bad thing about this ordering, for an application like a topic list in Wikipedia, is that the entries with parenthetical annotations are now widely separated from the same entries without such annotations, because the contents of the annotations are being collated without separation from the rest of the string. The order in the original listing was more or less just a side effect of the fact that the left parenthesis character "(" was given a primary weight lower than any letter, thus forcing all the annotations to sort ahead of any substring beginning with an actual letter--but that was a side effect that people were then depending on for the expected order of entries.
The other problematical aspect of this default UCA ordering is that many list users are unfamiliar with lists that ignore spaces, so that "Fleur < Fleurac < Fleur Adcock" just looks wrong to them. They expect, based on long experience with single-level computer sorts which weight SPACE as a character with a primary weight, that all the entries with "Fleur" as the first "word" will occur first, before "Fleurac", "Fleurbaix", etc.
So let's attempt to address some of the problems with Order 4 by further "fiddling with the dials" of the collation, so to speak.
First of all, we can set a UCA collation to completely ignore tertiary (case) differences. This will help a little, because it will group strings differing by punctuation together a little better, without allowing case distinctions in letters to override the punctuation differences.
The next thing we can do is to treat the annotations as separate fields for collation. In other words, entries with annotations would be treated by comparing the string up to the annotation first (using UCA collation), and then separately ordering next by the annotation (if any).
The ordering which results from these two changes is as follows:
Fleur Fleur (Disney, Bambi) Fleur (homonymie) Fleur (Homonymie) Fleur (personnage) Fleur (plante) Fleur (prénom) Fleurac Fleur Adcock Fleur annuelle Fleurbaix ... Fleur de lampaul Fleur De Lampaul Fleur de lis Fleurdelisé Fleur de lune Fleur de Lune Fleur de lys Fleur de Lys Fleur De Lys Fleur De Lys (groupe) Fleur de lys (homonymie) Fleur De Lys (Homonymie) Fleur-de-lys Fleur-de-Lys Fleur-De-Lys Fleurdelysé Fleur de lys florencée Fleur de ma ville Fleur de Mai Fleur De Métal Fleurdepied ... Fleurs de bach Fleurs de lis Fleurs de lys Fleurs de ruine ... Fleur stabilisée ...
I think this is much closer to user expectations for a wiki topic list like this. However, this order does not yet address the issue of people being unprepared for a list that ignores spaces.
So let's see what can be done about addressing the SPACE issue here. If we use a tailored UCA collation table, with a (low) tailored primary weight for SPACE, we will end up with much of the expected behavior in the listing, while still gaining the rest of the benefits of the multilevel sorting.
If we use this tailoring for SPACE, while also using the setting to ignore case differences and again treating parenthetical annotations as a second, separate field for sorting, we get the following listing:
Fleur Fleur (Disney, Bambi) Fleur (homonymie) Fleur (Homonymie) Fleur (personnage) Fleur (plante) Fleur (prénom) Fleur Adcock Fleur annuelle ... Fleur de lampaul Fleur De Lampaul Fleur de lis Fleur de lune Fleur de Lune Fleur de lys Fleur de Lys Fleur De Lys Fleur De Lys (groupe) Fleur de lys (homonymie) Fleur De Lys (Homonymie) Fleur de lys florencée Fleur de ma ville Fleur de Mai Fleur De Métal ... Fleur stabilisée ... Fleurac ... Fleurbaix ... Fleurdelisé Fleur-de-lys Fleur-de-Lys Fleur-De-Lys Fleurdelysé Fleurdepied ... Fleurs de bach Fleurs de lis Fleurs de lys Fleurs de ruine ...
I suspect that this is about the best you can do to meet user expectations for a listing like this, without starting down the path to much fancier (and expensive) techniques such as lemmatizing.
For example, in principle, it would be very nice if a plural entry such as "Fleurs de lys" could be found in the list right next to "Fleur de lys", instead of after "Fleurbaix,... Fleurdepied...," and many other entries. However, to accomplish that requires lemmatizing (which in turn requires support from a morphological dictionary) that knows that "Fleurs" is a plural form of "Fleur" and should be grouped with a singular head entry.
Similarly, it would be helpful to group alternative spellings together in some way, so that "Fleur de lis" is related to "Fleur de lys" in some apparent way, rather than being separated alphabetically by the unrelated "Fleur de lune", but identifying variant spellings requires dictionary lookup, and listing them usably requires some kind of cross-reference mechanism, rather than a simple list.
Lemmatized lists also cannot easily be presented as simple lists like this one. If you attempt to do so, you end up with lists that don't look like they are "in alphabetical order" and end up confusing the users. Fully lemmatized lists are instead generally presented in some kind of layered structure as "dictionary entries"—with lemmatized headwords, and with related items subordered beneath those headwords according to various principles.
If we can agree that an ordering for a topical list somewhat similar to that described above under Order 6 is roughly what people would prefer in such listings (as opposed to the existing behavior demonstrated under Order 1), are there implications that extend to how the Unicode Collation Algorithm should be applied or possibly modified to facilitate such ordering?
It has been argued that such topical list sorting poses additional requirements for the Unicode Collation Algorithm, and in particular that the behavior of French accents (which are weighted in the reverse direction at the secondary level in the UCA) poses difficulties for sorting such lists.
It should be apparent from the discussion above that in principle topical list sorting of this type is not particularly problematical. If the correct parameters are chosen for a UCA-based multilevel sort, with a tailoring of the SPACE character to match expectations, and with parenthetical trailing annotations treated as a separately weighted field, the expected results fall out fairly easily, as shown above.
It could be argued, however, that the algorithm should be enhanced so that this particular behavior just results by default, and could work for each entry simply treated as strings to compare. But it is clear that the UCA was not intended to be able to mimic the collation of fielded data. It is actually a better design to keep it that way, because there is no easy way to enhance the algorithm to mimic all the kinds of effects people might want from sequential comparison of separate fields, and an attempt to do so might clutter up the algorithm in ways that would disturb its intended scope for the simple comparison of two strings. Fielded comparisons need to be done by actually fielding the data to be compared. That is a much more flexible approach to the general problem than trying to build it into the string comparison (or sortkey construction) primitives for collation.
However, while in principle ordering such lists is not very complex, there may be practical concerns regarding storage, performance, or other issues which concern people. So let's take up a discussion of those issues.
The two main ways that the UCA is implemented in APIs are for incremental string comparison and for construction of full sortkeys for storage for binary comparison.
Incremental string comparison is optimal for the quick comparison of two strings according to a UCA multi-level collation table, but where the comparison of the same strings does not have to be done multiple times. In incremental comparison, collation elements for the strings are only compared until a difference is found between the strings, and a full sortkey doesn't have to be constructed. This saves both processing time and (usually) memory allocation.
Construction of full sortkeys for storage and later binary comparison is optimal for tasks such as the indexing of large tables, where any given single string may need to be compared hundreds, thousands, or even millions of times against other strings. In such a context, the additional processing time and allocation overhead to construct and store full sortkeys may be far outweighed by the gains from being able to do the multiple comparisons using very fast binary memory comparison (memcmp) operations on the stored sortkeys.
The particular issue posed by rules for French accent weighting and the UCA is that the lexicographical ordering rules for French accents specify that they are significant from the end of strings, rather than from the beginning. This differs from the treatment of accents for most languages in collation, and results in special parameters and sortkey construction rules for French in the UCA. (For details, see [UTS10].)
The concern for now is whether the requirements for handling French accents cause significant performance or architectural issues for handling cases such as sorting topical lists in the French Wikipédia.
There is no question but that the French accent handling in UCA imposes a performance penalty for comparing strings according to French rules. The disagreement is regarding how significant that performance penalty is, and whether it is related in some important way to the length and structure of strings to be compared.
In the case of construction of full sortkeys, the performance penalty for French reverse accent weighting is essentially trivial, unless working with truly enormous strings and keys that require piecewise buffering. If the algorithm has complete buffer access to the string and if the sortkey output can be preallocated, then the secondary weights are simply filled into the output key in reverse order. If the sortkey is constructed stepwise from the front of the string, and in particular in cases of very large strings, where it may not be practical to hold onto the entire string at once in fast buffers, then a practical approach to the problem is to construct the entire sortkey in forward order, and then once it has been constructed, reverse the run of secondary key weights. Doing so imposes a performance penalty in key construction, but particularly in cases where there is separate optimization to compact sortkeys into shorter lengths, such secondary key reversal can be incorporated into the rest of the sortkey post-processing at that point.
For incremental string comparison, on the other hand, the situation for French accent reversal is rather simpler. An efficient incremental string comparison doing French accent handling simply keeps track of a secondary weight flag while processing the string incrementally. Only if the processing gets all the way to the end of the string without a primary difference having been triggered, does the secondary flag then need to be examined. And when it is checked, it simply has flagged the last secondary comparison that made a difference, rather than the first in the string. There is never any need to reprocess the string from the back to the front, just to check secondary weights incrementally. The code to keep track of this secondary weight flag makes the incremental comparison algorithm more complicated, but the performance difference for forward accent weighting versus reverse accent weighting is down in the noise.
Now if we consider this reverse accent weighting in the context of requests for word boundary determination in the UCA, it becomes apparent that neither full sortkey construction nor incremental string comparison would require some kind of word boundary chunking of the strings for comparison, in order to have efficient (and correct) key construction.
For full, stored sortkeys, if there is a secondary (accent) distinction in the first word of two compared entries, it is still correct and necessary to check the primary weights of all the subsequent words in both strings for a difference, before that secondary distinction in the first word could be determinative of order. The sortkeys are already constructed so that the binary comparisons work by hitting a difference in the key at precisely the first point that makes a difference. Introducing some kind of word breaker to chunk the string into word increments for sortkey construction doesn't help at all. Instead, what word chunking does is attempt to emulate a string comparison based on fields for each word, and if used when actually building full sortkeys for a string would just produce rather jumbled and hard-to-interpret results.
For incremental string comparison, roughly the same considerations apply. A properly constructed implementation of an incremental string comparison gets to a significant secondary comparison at precisely the correct point—which is also the earliest possible point in the comparison that the difference could be significant. Introducing a word chunking step into the comparison accomplishes precisely nothing.
In practical applications such as a wiki topical list, there may, however, be additional considerations. If an application is constrained in storage, it may need to make various accomodations to what would otherwise be an ideal implementation. In particular, there may be cases where implementations of long, large lists are designed to truncate either the strings themselves or the stored sortkeys (or both).
If the strings for long lists are truncated, that is an application design issue that is basically outside the consideration of UCA per se. UCA (or any other weighted string comparison mechanism) can only work with the input data it has.
Effective truncation of strings depends on an application designer correctly identifying what parts of strings represent redundant information, and what represent locally significant information. If the data for lists contains long segments of repetitive and redundant information that interfere with seeing, displaying, or comparing the significantly different portions, application strategies that can be taken include (database) normalization into distinct tables for different parts of the strings. For display of such data, one can devise hierarchical strategies that mask repetitive data. For ordering of such hierarchical lists, the string comparisons (or sortkey storage) would be done separately at each level of the hierarchy, and if done that way, each comparison would be done with shorter strings (or shorter keys) relevant to the level in question.
In any case, these kinds of approaches are effectively means of fielding the data and then doing ordering within those fields. It is the choice of the design for fielding the data that makes the difference. The sortkey construction mechanism itself does not require special tweaking for word-by-word parsing (or other kinds of chunking) to then work efficiently to support ordering of the fielded lists.
On the other hand if the application design chooses to truncate sortkeys, there can be more unexpected results. An application that uses an implementation of UCA to construct sortkeys for arbitrarily long strings, but which then truncates them at some defined size for storage is essentially saying that at some point for long strings it doesn't matter how they compare—as beyond that point (and somewhat at random from the point of view of what is in them) strings will not compare as different even if their content actually is different.
When faced with a tradeoff like this, it would make sense to use whatever options the implementation has to optimize and compact sortkeys, as at least such options "know what they are doing" in making sortkeys shorter, whereas simple truncation of sortkeys is done with no knowledge of what point in a string and at what level of significance the truncation will matter.
It has been claimed that in a situation like this, and in particular for French data, that augmenting UCA with a word breaking option would somehow help with the problem and produce better results for list ordering when sortkeys are being truncated for long strings, but I think that claim is incorrect. First, if, as is usually the case, most strings in the list compare unequal at the primary level fairly early in the strings, attempting to chunk the strings for sortkey construction isn't going to matter for those strings anyway. Second, in the aberrant case in which there are many strings that have long identical prefixes and where the significant differences are getting lost due to sortkey truncation, the chunking isn't going to help, either—in such a case truncation of sortkeys is simply an application design error, and the problem should be handled in some other way.
But more fundamentally, word chunking for key construction doesn't really accomplish what its proponents seem to be claiming for it, as illustrated by the following schematic example. Compare the strings:
aaá xxxxxxxxxxxxxxxxxxxxxxxxxxxxx... aaa xxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
The essential claim seems to be that if I have strings like this, where there is a secondary difference early in a string, say in the very first "word" of the data, and if the strings are otherwise primary equal for the rest of the strings, and if my application is doing arbitrary length sortkey truncation which turns out to have truncated the secondary difference, so that I cannot tell that the first string actually compares greater than the second, then an enhancement to UCA that constructs sortkeys one word chunk at a time will get the correct answer for these strings.
Such an argument hangs on a very thin thread, however, because it assumes that the truncation is only hitting the secondary weights and not any primary weights. For example, given the following strings and a truncation which hits the sortkeys at the point indicated in the strings by a caret:
aaá xxxxxxxxxxxxxxxxxxxxxxxxx^xxxx... aaa xxxxxxxxxxxxxxxxxxxxxxxxx^xxxy...
An answer based on the difference in the accent in the first word would actually be wrong. The more significant difference in the primary weights (for the "y") would have been lost in the truncation. Essentially, you cannot get the correct results for long strings by comparing only the first words (including their secondary differences), unless you continue processing until you run out of primary differences, which could be very, very far along the string indeed.
And the situation for word-by-word chunking as a part of the sortkey construction only gets more complicated when you start introducing differences in word lengths, and start to consider the issues of how to parse and/or weight punctuation that might or might not be considered word boundaries, such as hyphens and apostrophes.
I conclude from this that an "enhancement" of UCA sortkey construction to somehow include a word boundary breaker isn't at all an answer for the conundrums and errors of comparison which may result from sortkey truncation for long strings. In such cases, the better answer is to examine the application design and the nature of the strings to be compared, to look for other optimizations that would produce better results given the application's storage constraints.
If a request for a word-chunking enhancement to the UCA to deal with edge case problems for sortkey truncation makes no sense, then what is the actual problem in list ordering that people advocating for it are trying to address.
In my opinion, the real goal here is to "fix" the behavior that the UCA algorithm displays when using the default table and default parameters, which results in listings that don't treat spaces as having primary weights, and which thus don't highlight word identity (by which people usually mean just "things separated by spaces") in ordered listings. In other words, people just don't like lists ordered as in the example in Section 2.4 above.
The fix for that, however, is not to somehow build in a word-breaker in the default algorithm, but instead to address the actual user issue, which is how spaces are being weighted for comparison.
Using the default settings for the ICU implementation gives the expected behavior for spaces, as long as the desired collation doesn't also require ignoring other punctuation differences.
If a collation is desired which ignores punctuation differences but still considers spaces as significant, the simple expedient of giving the space character a low primary weight and building sortkeys (or comparing strings incrementally) with that weighting restores the salient aspect of the list ordering that people are actually looking for, as demonstrated in the example in Section 2.6 above. Such behavior is entirely consistent with the UCA as currently defined. All it requires is tailoring the weight for SPACE in the default table, or alternatively supporting such a reweighting parametrically in an implementation of the UCA.
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[ICU] | International Components for Unicode http://www.icu-project.org/ Widely used Unicode library with full UCA collation support. |
[ICUDemo] | ICU Locale-Specific Collation Demo http://demo.icu-project.org/icu-bin/locexp?_=en&d_=en&x=col Demo which shows the behavior of UCA for sorting lists, allowing choice of various collation options. |
[UTS10] | UTS #10: Unicode Collation Algorithm (UCA) http://www.unicode.org/reports/tr10/ |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
The following summarizes modifications from the previous version of this document.
1 | Initial version. |
Copyright © 2010 Ken Whistler and Unicode, Inc. All Rights Reserved. The Unicode Consortium and [authors] make no expressed or implied warranty of any kind, and assume no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical note. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.