Unicode® Collation Algorithm
Conformance Tests

Version 16.0.0
2024-08-25

The following files provide conformance tests for the Unicode Collation Algorithm (UTS #10: Unicode Collation Algorithm).

CollationTest_SHIFTED.txt
CollationTest_NON_IGNORABLE.txt
CollationTest_SHIFTED_SHORT.txt
CollationTest_NON_IGNORABLE_SHORT.txt

These files are large, and thus packaged in zip format to save download time.

Note: These files test the sort order of an untailored DUCET table. If you are using an implementation of the CLDR Collation Algorithm with its tailored root collation data, for example ICU or a library that uses ICU for collation, then you need to test with files that reflect that sort order. The CLDR collation conformance test files have the same names (except for an added _CLDR infix) and structures as the ones here for the DUCET. You can find them in the CLDR GitHub repo in the folder “common/uca”, or in the CLDR data file download area, in the “cldr-common-*.zip” file, again in the folder “common/uca”. Select the files for the version of CLDR that is used in the implementation.

Format

There are four different files:

The shifted vs non-ignorable files correspond to the two alternate Variable Weighting values.
The SHORT versions omit the comments, for more compact storage.

The format is illustrated by the following example:

0385 0021;  # (΅) GREEK DIALYTIKA TONOS  [0316 015D | 0020 0032 0020 | 0002 0002 0002 |]

The part before the semicolon is the hex representation of a sequence of Unicode code points. After the hash mark is a comment. This comment is purely informational, and may change in the future. Currently it consists of the characters of the sequence in parentheses, the name of the first code point, and a representation of the sort key for the sequence.

The sort key representation is in square brackets. It uses a vertical bar for the ZERO separator. Between the bars are the primary, secondary, tertiary, and quaternary weights (if any), in hex.

Note: The sort key is purely informational. UCA does not require the production of any particular sort key, as long as the results of comparisons match.

Testing

The files are designed so each line in the file will order as being greater than or equal to the previous one, when using the UCA and the Default Unicode Collation Element Table. A test program can read in each line, compare it to the last line, and signal an error if order is not correct. The exact comparison that should be used is as follows:

Read the next line.
Parse each sequence up to the semicolon, and convert it into a Unicode string.
Compare that string with the string on the previous line, according to the UCA implementation, with strength = identical level (using S3.10).
If the last string is greater than the current string, then stop with an error.
Continue to the next line (step 1).

If there are any errors, then the UCA implementation is not compliant.

These files contain test cases that include ill-formed strings, with surrogate code points. Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines in the test cases, before testing for conformance.

© 2002–2024 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.

Use of all Unicode Products, including this publication, is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.

Unicode® Collation AlgorithmConformance Tests

Version 16.0.02024-08-25

Format

Testing

Unicode® Collation Algorithm
Conformance Tests

Version 16.0.0
2024-08-25