Collation
One of the most common things that processes implementing
the Unicode Standard need to do is compare Unicode strings.
As for ASCII or any other character encoding, Unicode
strings can simply be compared by their binary code values,
but for linguistically relevant string comparisons, more
sophisticated comparison is necessary, taking into account
casing and accents, ignoring certain characters, and so on.
The Unicode Consortium has published a separate standard devoted
specifically to this issue of string comparison, or collation:
UTS #10, Unicode Collation Algorithm (UCA).
That algorithm provides a complete specification of how
to generate collation keys for Unicode strings. Those
collation keys can then be compared directly in order to
make determinations about the comparison of the Unicode
strings they were generated from. Collation keys can
also be used in string matching and searching operations.
Committees Responsible for Collation
The Unicode Technical Committee is responsible
for the maintenance of both the Unicode Collation Algorithm
and the Default Unicode Collation Element Table (DUCET) which
provides all the basic collation key weighting information
used by the algorithm.
The CLDR Technical Committee is responsible
for maintaining information about language-specific tailoring
of the Unicode Collation Algorithm—for example, a Swedish-specific
collation, a Czech-specific collation, and so forth. Such
information is specified, using the CLDR tailoring syntax, in the
Common Locale Data Repository
(CLDR).
Policies Regarding Collation
The UTC has defined detailed policies that it uses in the maintenance
of the DUCET table for the Unicode Collation Algorithm.
The first set of policies covers constraints on how the
existing DUCET table can be changed. Those can be found in
Change Management for the Unicode Collation
Algorithm.
The second set of policies specifies criteria by which initial
collation weights are assigned to characters newly added to the Unicode
Standard. Those can be found in UCA Default Table Criteria for New Characters.
|