L2/06-254
Subject: UCA Options From: Mark Davis Date: 2006-07-28 In reviewing the proposed "Internet Application Protocol Collation Registry" (http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-12.txt), it became clear that the possible options in UCA are called out in the algorithm and document, but not clearly organized and named. In particular, they are given attribute names and values in CLDR; those should be also used in UCA. For the CLDR attribute names & values, and what they mean, see the Table "Collection Settings" in http://unicode.org/reports/tr35/#Collation_Elements. Here is a list of the options, with a fragment of text in UCA where they occur, and the corresponding CLDR options.
A. Normalization
UCA Text: Conformant implementations may skip this step [S1.1] in certain circumstances
CLDR Attribute: normalization (UCA default = "on")
B. Contractions
UCA Text: Conformant implementations may skip steps 2.1.1 through 2.1.3 if their repertoire of supported character sequences does not require this level of processing.
CLDR Attribute: none
Comment: This escape clause is old, and should be removed. If people skip steps 2.1.1..2.1.3, they will not recognize contractions properly.
C. Variable-Weight
UCA Text: S2.3 Process collation elements according to the variable-weight setting, as described in Section 3.2.2, Variable Weighting .
CLDR Attribute: alternate (UCA default = "shifted")
D. Backwards
UCA Text: S3.3 If the collation element table is forwards at level L,.... S3.6 Else the collation table is backwards at level L, so....
CLDR Attribute: backwards (UCA default = "off")
E. Strength (Level)
UCA Text: An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array.
CLDR Attibute: strength (UCA default ="3")
F. Semi-Stable
UCA Text: S3.10 If a semi-stable sort is required, then after all the level weights have been added, append a copy of the NFD version of the original string.
CLDR Attribute/value: strength="identical"
Note: CLDR doesn't allow the semi-stable option except with all weight levels, so having it as a "higher" weight level works. That should also be followed in UCA.
G. Preprocessing
UCA Text: 5.1 Preprocessing....Such preprocessing is outside of the scope of this document.
CLDR Attribute: numeric (UCA default="off")
CLDR Attribute: caseLevel (UCA default="off")
CLDR Attribute: caseFirst (UCA default="off")
CLDR Attribute: hiraganaQuaternary (UCA default="off")
CLDR Attribute: variableTop (UCA default="off")
In UCA we probably just want to point to CLDR as examples of options, but not make these options for UCA
H. Matching
UCA Text: Section 8
Note a typo, where sometimes the term "Whole Grapheme Clusters Only" is used, and sometimes "Whole Characters Only"; this needs fixing. Also, the discussion should call out the common matching operations "startsWith" and "endsWith": in particular, it should clarify that these tests return true if and only if there is a maximal match at the start (and end, respectively): the choice of medial or minimal matches would only affect additional (optional) returned positioning information.
CLDR Attributes: none
Both CLDR and UCA should have something like the following options defined. (I filed a CLDR bug at http://dev.icu-project.org/cgi-bin/locale-bugs?findid=1139.)
- match-boundaries: none, whole-character, whole-word
- match-style: minimal, medial, maximal
Note that these matching options are only relevant to matching: they don't have any effect on equality testing or ordering.