Re: Computing default UCA collation tables

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 2003 - 18:01:17 EDT

Next message: Michael Everson: "RE: Decimal separator with more than one character?"

Previous message: John Cowan: "Re: Computing default UCA collation tables"
Maybe in reply to: Philippe Verdy: "Computing default UCA collation tables"
Next in thread: Philippe Verdy: "Re: Computing default UCA collation tables"
Reply: Philippe Verdy: "Re: Computing default UCA collation tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy asked:

> I still wonder why some entries are commented out in the file
> and substituted by another definition on the next line

The reason is that the principle reviewers (in the UTC for
the Unicode Collation Algorithm, and in SC22/WG20 for ISO/IEC 14651)
require certain orderings in the default table for particular
instances. The requirements for the final ordering drive back
into requirements that certain characters be "marked up" in
the input file, so that the automatic weight generation by
sifter will produce the results expected. Let me give you
another example:

<quote>
0433;CYRILLIC SMALL LETTER GHE;Ll;;;;0413;;0413
0413;CYRILLIC CAPITAL LETTER GHE;Lu;;;;;0433;

# Add a user-defined diacritic, to force treatment as a variant of ghe
0491;CYRILLIC SMALL LETTER GHE WITH UPTURN;Ll;<sort> 0433 F8F1;;;0490;;0490
0490;CYRILLIC CAPITAL LETTER GHE WITH UPTURN;Lu;<sort> 0413 F8F1;;;;0491;
</quote>

The requirement that U+0491 CYRILLIC SMALL LLETTER GHE WITH UPTURN
sort as a secondary variant of U+0433 CYRILLIC SMALL LETTER GHE
was established by Russian sorting conventions and then was formally
conveyed in to SC22/WG20 as a requirement for the Common Template
Table of ISO/IEC 14651. To accommodate that formal requirement,
the input for the sifter has to be marked up with an ad hoc
"decomposition" for U+0491, treating the "upturn" as if it
were a diacritic, even though no such diacritic is encoded
in the Unicode Standard as a separate combining mark, and even
though U+0491 has no decomposition in UnicodeData.txt.

The "<sort>" tag for these decompositions is an ad hoc addition,
understood only by the sifter program, to accomplish the weighting
as desired. The use of a user-defined character, U+F8F1, to
represent the "phantom" diacritic, is also an internal
convention for the sifter.

> (does "sifter" has known limitations that are manually edited
> after generation from the undescribed source file containing
> those simple hidden decompositions?).

No.

>
> The quoted format you describe by a small fragment closely
> ressembles the one in the UCD. May be you don't want to disclose
> it because it uses some internal decomposition to private characters,

Correct, as shown above. It also has a number of other private
conventions, such as markup to force weighting of certain
combinations as contractions, and so on.

> but why then I did not find any reference to the other
> ISO/IEC 14651 standard you refer to in the description of DUCET ?
> May be Unicode does not define this default UCA collation order
> itself, and does not have the authorization from ISO/IEC to
> reproduce their ongoing work. This may explain why Unicode.org
> only publishes a derived file...

Dream on. The table for ISO/IEC 14651 is produced by the *same*
program, the sifter, and is provided by me directly *to* the editor of
ISO/IEC 14651, for incorporation in that standard.

Both the UTC and SC22/WG20 provide requirements for and feedback
on the content of the default tables for the two, coordinated
standards. Those two committees both have their say on details
of the default ordering, and then, through the input file,
its markup, and the sifter, I generate the two tables to their
specification. That process, by the way, is acknowledged
and agreed to by both committees,
as a way to guarantee that the default weighting for both
tables is synchronized, even though they use entirely different
formats to express the weighting.

>
> Well I must admit that we can postprocess the "allkeys.txt",
> but this is hardly possible without "reverse engeneering" it,
> i.e. analyzing how it is structured. I hope that you are not
> saying that such "reverse engeneering" of the table is not legal

Nope. Feel free to reverse engineer away to your heart's content.

Just don't expect to be congratulated for "discovering" things
about the table that are well-known to the maintainers of the
two standards and which are reflected in the input to the sifter
and in the weighting algorithms used by the sifter itself.

> (because of an "implicit" restriction by ISO/IEC 14651 which
> was not refered in the UCA document), because the UCA reference
> has many given hints to allow implementors to produce a
> compressed form of the table published by Unicode according
> to its royaltee-free usage terms.
>
> My intent was not to formulate criticisms about the UCA
> algorithm itself, but about the way TR10 describes the DUCET
> table (possibly because its wording is ambiguous and does not
> seem to specify clearly that DUCET should be normative, given
> that the whole text of UCA clearly speaks about a more general
> algorithm, with variable weights that can be easily changed in
> many places, and provides a lot of tuning parameters for
> implementations as well as for language-specific tailoring,

Correct. And the intent is that implementers are free to tailor
the table to get the results they need for particular languages.
And they can also implement the various shortcuts and tricks
indicated, to keep the generated keys more compact, and so on.

> confirmed also in the fact that the text of TR10 does not match
> with the DUCET table content).

Some of which doesn't matter at all. But I agree that you
turned up a confusing mismatch in Section 7.3, where the collation
element values should be updated. That should be corrected in
the v10 Proposed Update to the UCA.

>
> So I really read the description of DUCET as ONE possible
> implementation of the UCA algorithm, and not as THE reference.

Perhaps the language of UCA needs to be updated as well, to make
that clearer.

> Also I have read posts in this newsgroup about a candidate
> v10 for an updated version of the existing UCA TR10 v9
> reference document. I thought that after posting this revision,
> you expected comments about it,

We do...

> and that's why I about it, i.e.
> the way I had read it.

but instead of sending long, rambling analyses to the unicode
list, including many mistaken assertions of fact about the
standard, you can contact the authors of the document directly
(our email addresses are in the header of the document) for
clarifications of intent about the document, and then
provide (succinct) feedback through the Unicode reporting form:

http://www.unicode.org/reporting.html

noting your feedback as a "Technical Report or Tech Note issue",
so that the feedback can be properly archived, routed, and
attended to by the UTC and the authors of the document.

--Ken

Next message: Michael Everson: "RE: Decimal separator with more than one character?"
Previous message: John Cowan: "Re: Computing default UCA collation tables"
Maybe in reply to: Philippe Verdy: "Computing default UCA collation tables"
Next in thread: Philippe Verdy: "Re: Computing default UCA collation tables"
Reply: Philippe Verdy: "Re: Computing default UCA collation tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 18:53:04 EDT