To: Network Working Group
Re: draft-newman-i18n-comparator
Date: 2006-02-21
From: Unicode Technical Committee

The Unicode Technical Committee has reviewed the document http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt. While UTC is in favor of the goal, there are a number of problems with the document. The main problems are outlined below. Once these are addressed, then further review can continue.

Details

> 2.1 Definitions

Content

The document needs to include the definitions of the technical terms used in the document,  including all those that may not be familiar to implementers, such as "trichotomous" and "collation identifiers". In particular, the notion of a substring is prima facie quite simple, but there are complications that require a clear definition. The text in the document does not make clear that there may be more than one match for a substring in a string, and that the matches can overlap. It says "the starting offset", for example, when there may be multiple ones.

Moreover, language sensitive matches have additional complications which need to be called out. For more information, see http://www.unicode.org/reports/tr10/#Searching

Format

If there is a "Definitions" section, readers have a reasonable expectation that that section should contain all the required definitions. However, a number of definitions are scattered within the text. One of two approaches should be taken

  1. Move all the definitions into this section.
  2. Remove the definitions section, but clearly call out in the text the definitions of  each terms on its own line.

Mixing these two styles is needlessly confusing for readers.

> 2.4 Sort Keys

The use of the term "collation canonicalization" to refer to sort keys is very misleading. The term "canonicalization" implies that the results are still text in some fashion, whereas a sortkey is simply a string of octets generated from a given string by a specific comparator, whereby the binary comparison (ordering) of two sort keys is guaranteed to match *that* comparator's compare function for the original strings. The octets may have no readily discernable relation to the original text. For example, the ICU sort keys generated for the following strings are:

cote 2c 44 4e 30 01 08 01 08 00
côté 2c 44 4e 30 01 85 93 85 8d 01 0a 00
Αραβικά 5c 20 52 20 22 36 3a 20 01 80 8d 01 8f 0b 00

See http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=el&d_=en&x=col for other examples.

> 3.2

This specifies that clients that support disconnected operation should not use wildcards while clients that provide collation operations only when connected to the server may use wildcards.

It appears the restrictions are may not be really needed and the restrictions may need to be deleted from the draft. Otherwise, it would really helpful if the rationale behind the restrictions are provided at the draft.

The EBNF syntax shown in section 3.2 says that the collation-wild must not exceed 255 characters total while the section 3.1 specifies that the collation name must not exceed 254 characters.

It seems having the same maximum possible length for both collation name and wildcard string would be desirable for actual implementations.

> 4.2.1 Equality

It needs to be made clear that the return values are not physically the strings "match", etc. but enumerated values such as equal and  not_equal. The document could describe a notation used for them, such as single quotes, since italic is not available in RFCs. Similarly, the results of the ordering function should be specified as an enumeration with three values: less, equal, greater. The mapping actual API return values in implementations to these enumerated values can be outside of the scope of this document. For example, the mapping might take -1 onto less in one implementation, or anything negative onto less in another implementation.

One extremely important point is that for a given comparator, the equality function must be synchronized with the ordering function. That is, it must return 'equal' if and only if the ordering function returns 'equal'. Otherwise any coordinated usage of the functions will fail. This also implies that either 'error' is allowed for both functions or for neither.

The term 'error' is also problematic, since what is really at issue is a question of domain. For all those strings in the domain, either 'equal' or 'not_equal' should be returned from the equality function. For any string not in the domain, 'undefined' should be returned. That avoids coherency problems. Then the requirements are clear:

There is a typo at the 4'th line of the second paragraph of the section 4.2 saying "... For example, an collation" which should be changed to "... For example, a collation" instead.

> 4.2.2 Substring

Prefix and suffix matching are not fully spelled out. The operations and their results must be clarified. And as noted before, it is very important to precisely define the substring operations, especially the starting offset and ending offset. It also must be clarified whether what is being asked for is the first possible matching location in the string, the last, or the nth one.

> 4.3.3 Ordering

> It MUST be transitive and trichotomous.

As above, these should be defined. The exposition in this section would be simpler if you also defined "reversible", whereby f(a,b) = less iff f(b,a) = greater. Then the statement would be:

It MUST be transitive, trichotomous, and reversible.

>When the collation is used with a
   "-" prefix, the result of the ordering function of the collation MUST
   be reversed.

=> When the collation is used with a
   "-" prefix, the result of the ordering function of the collation when applied to two strings A and B  MUST
   be the same as the result with a "+" prefix applied to B and A.

An 'undefined' value can be allowed if, as per equality above, it means that at least one of the operands is outside of the domain. The function then imposes a total order on all strings in the domain; moreover, a wrapper can easily convert the function to a total order over all strings by putting all items outside the domain either below or above the ones in the domain -- or even excluding them, at its choice.

 > In general, collations SHOULD NOT return "0" unless the two strings are identical.

=> The ordering function MUST return 'equal' if and only if the equality function returns 'equal'

[Note: it is very important to avoid the confusion between "identical" and "equal". According to a caseless compare, "Mark" and "mark" are equal; however, the strings are not identical.]

[Either 'ordering function' or 'comparison function' should be used consistently, not sometimes 'collations'].

> 4.3.  Internal Canonicalization Algorithm

This section is difficult to understand. It appears that goal is that any registration must specify sufficient detail, both data and algorithm, so as to enable someone to reproduce the results. But it is not at all clear that that is the goal. And that would make the registration require, in some cases, a huge accompanying document. To duplicate the results of CLDR collators, for example, would require the UCA specification, plus the LDML specification, plus all the relevant data in the CLDR repository.

> 4.4.  Use of Lookup Tables

It is not at all clear what is meant by "customizable lookup tables".

> 4.5.  Multi-Value Attributes

This is very unclear. It describes attributes as applying to only equality (since it only refers to "match" vs "no-match" (and forgetting "error")).

This is a very important feature that needs to be spelled out in detail, and clearly reflected in the template for registration. In particular, the template should have provision for multiple attributes, with the ability to specify the acceptable operands for that attribute. (See below). The specification of the operands could be either a list of values, or a regular expression (with the former preferred). Suggested regular expression syntax would be Perl or XML Schema.

> 5.1Character Encoding

   The protocol specification has to make sure that it is clear on which
   characters (rather than just octets) the collations are used.  This
   can be done by specifying the protocol itself in terms of characters
   (e.g. in the case of a query language), by specifying a single
   character encoding for the protocol (e.g.  UTF-8 [3]), or by
   carefully describing the relevant issues of character encoding
   labeling and conversion.  In the later case, details to consider
   include how to handle unknown charsets, any charsets which are
   mandatory-to-implement, any issues with byte-order that might apply,
   and any transfer encodings which need to be supported.

If a collation is able to advertise itself as being able to handle, say, SJIS and UTF-8, then there should a required description of a protocol for indicating that and for communicating which encodings are handled, and how it handles error conditions (such as a charset outside of those it can handle. Otherwise, it is difficult to understand how this paragraph would be applied in practice.

> 5.3

The section 5.3 specifies:

The protocol MUST specify how comparisons behave in the absence of explicit collation negotiation or when a collation of "*" is requested. The protocol MAY specify that the default collation used in such circumstances is sensitive to server configuration.

and the section 3.2 specifies:

... If the wildcard string matches multiple collations, the server SHOULD select the collation with the broadest scope (preferably international scope), the most recent table versions and the greatest number of supported operations. A single wildcard character ("*") refers to the application protocol collation behavior that would occur if no explicit negotiation were used.

These appear inconsistent.

7.5.  Example Initial Registry Summary

The sample registry would suffer a combinatorial explosion if parameters are not handled differently. For example, with CLDR collations, there can be hundreds of locales, six different strength settings; four different case-first settings; three different alternate settings, backwards settings, normalization settings, case level settings, hiragana settings, and numeric settings; plus a variable-top setting which has a string as an operand. Registering the combinations that people are allowed to use would be untenable.

http://www.unicode.org/draft/reports/tr35/tr35.html#Setting_Options

Instead, as remarked above, the allowable attribute values need to be associated with the registered name in a machine-readable form.

> 11.  Security Considerations

This is insufficient. It should at least point to the problems related in UCA and in http://www.unicode.org/reports/tr36/tr36-4.html (note that that document has been approved by the UTC and will be posted as an approved version soon.)

General

One of the real problems with the IANA character registry is that the entries are underspecified. It quite often occurs that two vendors implement the same IANA charset conversion different ways, leading to significant interoperability problems and text corruption. See, for example, http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.

We have the real concern that this registry could lead down the same path.

> collation, it has to say so

There are places where the text should be clarified, as to whether a MUST or SHOULD is implied; this is just an example.

> "comparator" vs "collator"

Either one term or the other should be used consistently.

> Unicode 3.2

Unicode 3.2 is obsolete; the the reference versions for the Collation Registry should be Unicode 5.0 and UCA 5.0, since those will be approved and published by the time the Internet Application Protocol Collation Registry has completed its review and been approved.

Because of the use of NamePrep, it is probably the case that Unicode 3.2 also needs to be included, but strongly recommended for usage only by protocols or systems dependent on NamePrep. Note that as of UCA 4.0 and beyond, the version number of UCA is guaranteed to be identical with the version number of Unicode that it is defined for.

> Versioning

This is tricky, and should be clarified. In many instances, it is sufficient to use an unversioned collator, such as simply "UCA". In other cases, there are requirements to use a specific version, or a version of at least X. This needs to be described.