Collation
Q: My script does not sort right because
the characters were assigned to Unicode code points in the wrong order.
What can I do about that?
A: There is a misunderstanding here: Linguistically
meaningful sorting is done not by comparing code point values (an
approach which would fail even for English), but by assigning
multi-level weights to characters or sequences of characters and then
comparing those weights on each level. There are many algorithms and
implementations for this; the standard
Unicode Collation
Algorithm (UCA) comes with a default weight table for all assigned
characters as well as a tailoring mechanism that describes how this
table can be modified to conform to local conventions, where necessary.
[MS]
Q: How should collations be made
available?
A: Ideally, people should be able to specify a collation
order for any set of data returned by a database query and sorted by a
SQL 'ORDER BY' clause. Actual database implementations may differ in how
they surface the choices of collations to users. Differing collations
should also be specifiable for any comparison (e.g. s1 < s2) of strings,
unless a strictly binary order comparison is intended. People should also be able to use collations for doing loose matching, and
string searching. For more information, see:
http://www.unicode.org/reports/tr10/
Q: Where can I find out more information
on how Java does collation?
A: Search for the RuleBasedCollator class at
http://java.sun.com/j2se/1.4.2/docs/api/.
For a C/C++ version, you can also look at:
http://www.icu-project.org/.
Q: How shall the collation to be used be
specified, taking into account current implementations.
A: To specify a collation, clients should be either able to
specify a locale (e.g. collate as in "de_DE") or tailoring rules (as in
Java or ICU) or both. Java and ICU also allow merging: e.g. French +
Arabic + tailoring.
Q: UTS #10 Unicode Collation Algorithm is
defined for a particular version of the Unicode Standard, but I am
using characters from a later version of Unicode. What shall I do?
A: You can update to a later version of the Unicode
Collation Algorithm, which will be synchronized with a later version of the Unicode Standard.
The UTC is committed to ensuring
that the Unicode Collation Algorithm is updated in a timely manner, so
that the repertoire of characters in the Default Unicode Collation
Element Table stays in synch with the Unicode Standard. However, if you
need to stay with a particular version of the Unicode Collation
Algorithm for any reason, such as maintaining binary compatibility of
generated key weights, note that the algorithm does assign a default
sorting order to every valid code point, assigned or unassigned. Any
characters that are not assigned in the repertoire for that version will be
given derived, implicit weights in code point order after all of the
assigned characters. See 7.1
Derived
Collation Elements for more details.
Q: Is transitive consistency maintained by
the UCA?
A: Yes, for any strings A, B, and C, if A < B and B < C,
then A < C. However, implementers must be careful to produce
implementations that accurately reproduce the results of the Unicode
Collation Algorithm as they optimize their own algorithms. It is easy to
perform careless optimizations — especially with
Incremental Comparison algorithms — that fail this test. Other items
to check are the proper distinction between the bases of accents. For
example, the sequence <u-macron, u-diaeresis-macron> should compare as
less than <u-macron-diaeresis, u-macron>; this is a secondary
distinction, based on the weighting of the accents, which must be
correctly associated with the primary weights of their respective base
letters.
Q: Does JIS require tailorings?
A: The Default Unicode Collation Element Table uses the
Unicode order for CJK ideographs (Kanji). This represents a
radical-stroke ordering for the characters in JIS levels 1 and 2. If a
different order is needed, such as an exact match to binary JIS order
for these characters, that can be achieved with tailoring.
Q: How are Hiragana readings handled for
Kanji?
A: There is no algorithmic mapping from Kanji characters to
the phonetic readings for those characters, because there is too much
linguistic variation. The common practice for sorting in a database by
reading is to store the reading in a separate field, and construct the
sort keys from the readings.
Q: How are mixed Japanese and Chinese
handled?
A: The Unicode Collation Algorithm specifies how collation
works for a single context. In this respect, mixed Japanese and Chinese
are no different than mixed Swedish and German, or any other languages
that use the same characters. Generally, the customers using a
particular collation will want text sorted uniformly, no matter what the
source. Japanese customers would want them sorted in the Japanese
fashion, etc. There are contexts where foreign words are called out
separately and sorted in a separate group with different collation
conventions. Such cases would require the source fields to be tagged
with the type of desired collation (or tagged with a language, which is
then used to look up an associated collation).
Q: Are the half-width katakana properly
interleaved with the full-width?
A: Yes, the Default Unicode Collation Element Table
properly interleaves half-width katakana, full-width katakana, and
full-width hiragana. It also interleaves the voicing and semi-voicing
marks correctly, whether they are precomposed or not.
Q: Can the katakana length mark be handled
properly?
A: Yes, by using a combination of contraction and
expansion, the length mark can be tailored to sort according to the
vowel of the previous katakana character. For a description of the
phenomenon involved and how to handle it, see
Contextual Sensitivity.
Q: How are names in a database sorted
properly?
A: In international sorting, it will make a difference whether
strings in one field are sorted first and strings in a second field are sorted
subsequently, or whether a single sort is done considering both fields together.
This is because international sorting uses multi-level comparison of differences
in strings. Suppose that your database is sorted first by family name, then by
given name. Since family names are sorted first, a secondary or tertiary
difference in the family name will completely swamp a primary difference in
the given name. So {field1=Casares, field2=Zelda} will sort before
{field1=Cásares, field2=Albert}.
This is not the typically desired behavior. The database should be sorted by a
constructed field which contains family name + <separator> + given name.
Typical historical practice was to use a ',' as the separator. However, that
does not work for collation sequences that ignore punctuation. A better
option, which is in CLDR 1.9 or later, is to use U+FFFE as this separator.
CLDR tailors this code point to sort before any other base character, for
exactly this purpose, so that the record with {field1=Cásares, field2=Albert}
sorts before the record with {field1=Casares, field2=Zelda}.
For more information on this topic, see
Interleaved Levels.
Q: How can I use the Unicode Collation
Algorithm for a stable sort?
A: A stable sort is one where identical records come out in
the same order as they were originally in. To achieve this, the easiest
way is to append an index number for each record to the sort key for
that record. Whether that sort key comes from strings, other data, or a
concatenation of sort keys, it will then produce a stable sort. Further
information about stable sorts and related topics can be found in
Deterministic Sorting.
Q. What are the differences between the UCA and ISO 14651?
A. Very broadly, the UCA includes the following features
that are not part of ISO 14651. This is only a sketch; for details see
http://www.unicode.org/reports/tr10/.
-
a much more thorough introduction to multilingual sorting
issues
-
much more information about performance and
implementation practices
-
how to apply collation to searching and matching
-
uniform handling of canonical equivalents
-
variable weighting (allowing punctuation to be ignored or
not)
-
irrelevant
combining characters don't interfere with contractions
-
well-formedness criteria for tables (disallowing tables
that would produce peculiar results, e.g. where X and Y don't
contract, X < Y and yet XY == YX)
Q. What can you tell me about searching and sorting with Braille?
A. The individual Braille patterns are not tied to specific characters. A
pattern that represents an "A" for English might represent a completely
different letter or symbol or ideograph for another language. Therefore,
search and sort engines cannot assume that the underlying meaning of any
individual Braille pattern is fixed. It can and will vary by language,
greatly affecting how searching and sorting rules are defined, and how
strings that contain Braille patterns are interpreted.
[SO]
Q. In my language, "ch" usually sorts like a separate letter. If I want a
foreign word to sort without this happening, how do I do it?
A. You use the CGJ, as described in
Characters and Combining Marks.
Q. What policies constrain allowable changes to UCA between versions?
A. The UTC has established a number of policies which help to keep the UCA and its associated data table (DUCET) stable, even as the UCA is updated to stay in synch with additions to the Unicode Standard. First there are policies which define how collation weights should be established for newly assigned characters and scripts. Those can be found in UCA Default Criteria for New Characters. There are also policies which limit the kinds of changes which can be made for characters already in the DUCET, and which define how potential updates should be specified and tracked. Those can be found in Change Management for the Unicode Collation Algorithm.