Technical Notes |
Version | 1 |
Authors | Mark Davis |
Date | 2003-07-28 |
This Version | http://www.unicode.org/notes/tn9/tn9-1.html |
Previous Version | none |
Latest Version | http://www.unicode.org/notes/tn9/ |
There is often confusion about what is meant by the terms "stable" or "deterministic" when applied to sorting or comparison. This Technical Notes clarifies the issues surrounding these terms.
This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes.
There is often a good deal of confusion about what is meant by the terms "stable" or "deterministic" when applied to sorting or comparison. This confusion in terms often leads people to make mistakes in their software architecture, or make choices of language-sensitive comparison options that have significant impact in terms of performance and footprint, and yet do not give the results that users expect.
This technical note describes the meaning of each of these term in the proper context, and clarifies their usage and implications.
A stable sort is one where two records will retain their order when sorted according to a particular field, even when the two fields have the same contents. Thus those two records come out in the same relative order that they were in before sorting, although their positions relative to other records may change. Importantly, this is a property of the sorting algorithm, not the comparison mechanism.
Two examples of differing sorting algorithms are Quick Sort and Merge. Quick Sort is not stable while Merge Sort is stable. (A Bubble Sort, as typically implemented, is also stable.)
Suppose that you have the following records and you sort on the Last_Name field only:
Record | Last_Name | First_Name |
---|---|---|
1 | Davis | John |
2 | Davis | Mark |
3 | Curtner | Fred |
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
1 | Davis | John |
2 | Davis | Mark |
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
2 | Davis | Mark |
1 | Davis | John |
As you see from the above, the Quicksort algorithm is not stable; the records 1 and 2 are not in the same order they were in before sorting.
A stable sort is often desired; for one thing, it allows records to be successively sorting according to different fields, and retain the correct lexigraphic order. Thus with a stable sort, one could all the records by First_Name, then sort all the records by Last_Name, giving the desired results: that all records would be sorted by Last_Name, and where the Last_Names are the same, sorted by First_Name.
A deterministic sort is a very different beast. This is a sort algorithm that returns the same results each time. On the face of it, it would seem odd for any sort algorithm to not be deterministic, but there are examples of real-world sort algorithms that aren't.
The key concept is that these sort algorithms are deterministic when two records have unequal fields, but they may return different results at different times when two records have equal fields.
For example, a classic Quick Sort algorithm works recursively on ranges of records. For any given range of records, it takes the first element as the pivot element. We're not going to explain the details here, but that algorithm performs badly with input data that happens to be sorted (or mostly sorted). A randomized Quick Sort, which picks a random element as the pivot, can on average be faster. The results it gets are not deterministic; from exactly the same input, different results can be output:
|
or |
|
As another example, multiprocessor sorting algorithms can be non-deterministic. The work of sorting different blocks of data are farmed out to different processors and then merged back together. The ordering of records with equal fields might be different according to when different processors finish different tasks.
Note that a deterministic sort is weaker than a stable sort. A stable sort is always deterministic, but not vice versa. And typically when people say they want a deterministic sort, they really mean that they want a stable sort.
A deterministic comparison is different than either of the above; it is a property of a comparison function, not a sorting algorithm. This is a comparison where strings that do not have identical binary contents (optionally, after some process of normalization) will compare as unequal. A deterministic comparison is sometimes called a stable (or semi-stable) comparison.
There are many people who confuse a deterministic comparison with a deterministic (or stable) sort, but once again, these are very different creatures. A comparison is used by a sorting algorithm to determine the relative ordering of two fields, such as strings. Using a deterministic comparison cannot cause a sort to be deterministic, nor to be stable. Whether a sort is deterministic or stable is a property of the sort algorithm, not the comparison function. If you look at the examples above, this is clear.
One can produce a deterministic comparison function from a non-deterministic one, in the following way (in pseudo-code):
int new_compare (String a, String b) { int result = old_compare(a, b); if (result == 0) { result = binary_compare(a, b); } return result; }
Programs typically also provide the facility to generate a sort key,
which is a sequences of bytes generated from a string in alignment with a
comparison function. Two sort keys will binary-compare in the same order as
their original strings. To create a deterministic sort key that aligns
with the above new_compare
, the simplest means is to append a
copy of the original string to the sort key. This will force the comparison to
be deterministic.
byteSequence new_sort_key (String a) { return old_sort_key(a) + SEPARATOR + toByteSequence(a); }
Because sort keys and comparisons must be aligned, a sort key generator is deterministic if and only if a comparison is.
However, a deterministic comparison is generally not best practice. First, it has a certain performance cost in comparison, and a quite substantial impact on sort key size. (For example, [ICU] language-sensitive sort keys are generally about the size of the original string, so appending a copy of the original string generally doubles the size of the sort key.) A database using these sort keys can have substantially increased disk footprint and memory footprint, and consequently reduced performance.
More importantly, a deterministic comparison function does not actually achieve the effect people think it will have. Look at the sorted examples above. Whether a deterministic comparison is use or not, there will be no effect on Quick Sort example, since the two records in question have identical last name fields. It does not make a non-deterministic sort into a deterministic one, nor does it make a non-stable sort into a stable one.
Thirdly, a deterministic comparison is often not what is wanted, when people look closely at the implications. Look at our above example, and suppose that this time the user is sorting first by last name, then by first name.
Record | Last_Name | First_Name |
---|---|---|
1 | Davis | John |
2 | Davis | Mark |
3 | Curtner | Fred |
The desired results are the following, which should result whether the sorting algorithm is stable or not, since we are using both fields.
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
1 | Davis | John |
2 | Davis | Mark |
Now suppose that in record 2, the source for the data caused the last name to contain a control character, such as a ZWJ (used to request ligatures on display). There is in this case no visible distinction in the forms, since the font doesn't have any ligatures for these values. The standard UCA collation sequence causes that character to be—correctly—ignored in comparison, since it should only affect rendering. However, if that comparison is changed to be deterministic (by appending the binary original string, then unexpected results will occur.
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
2 | Davis | Mark |
1 | Davis | John |
Typically, what people really want when they say they want a deterministic comparison is actually a stable sort.
One can force non-stable sorting algorithm to produce stable results by how one does the comparison. However, it has literally nothing to do with making the comparison be deterministic or not. Instead, it can be done by appending the current record number to the strings to be compared. (The implementation may not actually append the number; it may use some other mechanism, but the effect would be the same.)
If such a modified sort comparison is used, for example, then it forces Quick Sort to get the same results as Merge Sort. And in that case, the irrelevant character ZWJ does not affect the outcome, as illustrated below, and the correct results occur.
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
1 | Davis | John |
2 | Davis | Mark |
If anything, this then is what users want when they say they want a deterministic comparison.
There are a few other terms worth mentioning, simply because they are also subject to considerable confusion. Any or all of the following may be easily confused with above.
A stable comparison is one that does not change over successive software versions. That is, as one uses successive versions of an API, with the same "settings" (such as locale), one gets the same results.
A stable sort key generator is one that generates the same binary sequence over successive software versions.
Warning: if the sort key generator is stable, then the associated comparison will perforce be. However, the reverse is not guaranteed. To take a trivial example, suppose the new version of the software always adds an 0xFF byte at the front of every sort key. The results of any comparison of any two new keys would be identical to the results of the comparison of any two corresponding old keys. But the bytes have changed, and the comparison of old and new keys would give different results. Thus one can have a stable comparison, yet an associated non-stable sort key generator.
A portable comparison is where corresponding APIs for comparison produce the same results across different platforms. That is, if one uses the same "settings" (such as locale), one gets the same results.
A portable sort key generator is where corresponding sort key APIs produce exactly the same sequence of bytes across different platforms.
Warning: as above, a comparison may be portable without the associated sort key generator being portable.
Ideally, all products would have the same string comparison and sort key generation for, say Swedish, and thus be portable. For historical reasons, this is not the case. Even if the main letters sort the same, there will be differences in the handling of other letters, or symbols, punctuation, and other characters. There are some libraries that offer portable comparison, such as [ICU], but in general the results of comparison or sort key generation may vary significantly between different platforms.
In a closed system, or in simple scenarios, portability may not matter. Where someone has a given set of data to present to a user, and just wants the output to be reasonably appropriate for Swedish, then the exact order on the screen may not matter.
In other circumstances, it can lead to data corruption. For example, suppose that two implementations do a database SELECT for records between a pair of strings. If the collation is different in the least way, they can get different data results. Financial data might be different, for example, if a city is included in one SELECT on one platform and excluded from the same SELECT on another platform.
There are efforts to try to come to some common ground on these sorts of issues. See [CommonLocale] for more information.
Thanks to Ken Whistler and Cathy Wissink for their many useful suggestions and contributions to the text.
[CommonLocale] | For information on the Common XML Locale Repository
project, see http://www.openi18n.org/subgroups/lade/locale/. For the XML specification, see http://www.openi18n.org/specs/ldml/1.0/ldml-spec.htm. |
[ICU] | For information about the ICU Unicode service library
(C, C++, Java) see http://www.icu-project.org |
[Interleaved] | For a discussion of other issues connected with sorting multiple fields, see http://www.unicode.org/reports/tr10/tr10-10.html#Interleaved_Levels. |
[SortAlg] | For background on the names and characteristics of
different sorting methods, see http://www.aeriesoft.ru/Projects/SortAlg/ |
[UTS10] | For general issues regarding language-sensitive
collation, see http://www.unicode.org/reports/tr10/tr10-10.html. |
[Unstable] | For a definition of stable sorting, see http://planetmath.org/encyclopedia/UnstableSortingAlgorithm.html |
The following summarizes modifications from the previous version of this document.
2008.10.01 | Updated stale links in version 1 |
1 | Initial version |
© 2003 Mark Davis. This publication is protected by copyright, and permission must be obtained from the author and Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use.
Use of this publication is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.