L2/

L2/06-254

Subject: UCA Options

From: Mark Davis

Date: 2006-07-28

In reviewing the proposed "Internet Application Protocol Collation Registry" (http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-12.txt), it became clear that the possible options in UCA are called out in the algorithm and document, but not clearly organized and named. In particular, they are given attribute names and values in CLDR; those should be also used in UCA. For the CLDR attribute names & values, and what they mean, see the Table "Collection Settings" in http://unicode.org/reports/tr35/#Collation_Elements. Here is a list of the options, with a fragment of text in UCA where they occur, and the corresponding CLDR options.

A. Normalization

UCA Text: Conformant implementations may skip this step [S1.1] in certain circumstances

CLDR Attribute: normalization (UCA default = "on")

B. Contractions

UCA Text: Conformant implementations may skip steps 2.1.1 through 2.1.3 if their repertoire of supported character sequences does not require this level of processing.

CLDR Attribute: none

Comment: This escape clause is old, and should be removed. If people skip steps 2.1.1..2.1.3, they will not recognize contractions properly.

C. Variable-Weight

UCA Text: S2.3 Process collation elements according to the variable-weight setting, as described in Section 3.2.2, Variable Weighting .

CLDR Attribute: alternate (UCA default = "shifted")

D. Backwards

UCA Text: S3.3 If the collation element table is forwards at level L,.... S3.6 Else the collation table is backwards at level L, so....

CLDR Attribute: backwards (UCA default = "off")

E. Strength (Level)

UCA Text: An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array.

CLDR Attibute: strength (UCA default ="3")

F. Semi-Stable

UCA Text: S3.10 If a semi-stable sort is required, then after all the level weights have been added, append a copy of the NFD version of the original string.

CLDR Attribute/value: strength="identical"

Note: CLDR doesn't allow the semi-stable option except with all weight levels, so having it as a "higher" weight level works. That should also be followed in UCA.

G. Preprocessing

UCA Text: 5.1 Preprocessing....Such preprocessing is outside of the scope of this document.

CLDR Attribute: numeric (UCA default="off")

CLDR Attribute: caseLevel (UCA default="off")

CLDR Attribute: caseFirst (UCA default="off")

CLDR Attribute: hiraganaQuaternary (UCA default="off")

CLDR Attribute: variableTop (UCA default="off")

In UCA we probably just want to point to CLDR as examples of options, but not make these options for UCA

H. Matching

UCA Text: Section 8

Note a typo, where sometimes the term "Whole Grapheme Clusters Only" is used, and sometimes "Whole Characters Only"; this needs fixing. Also, the discussion should call out the common matching operations "startsWith" and "endsWith": in particular, it should clarify that these tests return true if and only if there is a maximal match at the start (and end, respectively): the choice of medial or minimal matches would only affect additional (optional) returned positioning information.

CLDR Attributes: none

Both CLDR and UCA should have something like the following options defined. (I filed a CLDR bug at http://dev.icu-project.org/cgi-bin/locale-bugs?findid=1139.)

match-boundaries: none, whole-character, whole-word

match-style: minimal, medial, maximal

Note that these matching options are only relevant to matching: they don't have any effect on equality testing or ordering.