From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 2003 - 16:11:04 EDT
Following up on Philippe Verdy's responses to Mark Davis:
> That's why I wondered how the "allkeys.txt" was really produced,
As Mark Davis indicated, it is generated by a program (which I
maintain) called "sifter". It *is* automatically generated
and not manually edited.
> because it uses some weights ordering that is not documented in
> the UCA collation rules specification
*None* of the weights in the allkeys.txt table are documented
in the UCA collation rules specification per se. All of the
weights for the Default Unicode Collation Element Table are
defined in allkeys.txt, which the UCA specification refers to
by reference. Specific *examples* cited in the text of UCA
to illustrate the use of the weights are exemplary only -- not
normative in value -- and the authors of UTS #10, UCA (that's
Mark Davis and myself) deliberately don't try to update all
those example numbers each time allkeys.txt is updated, since
there is no point, and since such manual editing probably would
induce errors and inconsistencies into the examples.
> (the only thing that is
> normative, the "allkeys.txt" being just informative and a
> correct implementation of the specified rules).
As Mark indicated, this statement is just flat wrong. A claim
of conformance to UTS #10 includes a requirement to abide
by conformance clause C4. That requires specifying a particular
version of the UTS, and that, in turn, via Clause 3.2, points
to a particular, associated version of allkeys.txt. And that
table, in turn, is required to meet the requirements of
conformance clause C1.
So while the UCA allows any tailoring you desire, to meet particular
language requirements, it is still quite clear that the data
table itself is a normative part of the standard.
Philippe also seems to have missed the fact that the allkeys.txt
table is maintained in conjunction with and in synchrony with the
Common Template Table of the ISO international string ordering
standard, ISO/IEC 14651. That table is also generated by the
"sifter" program, and the CTT is clearly labeled normative in
ISO/IEC 14651.
> Yes my message was long, but I wanted to show the many points
> coming from the analysis of the "allkeys.txt" proposed as an
> informative reference,
It is not an "informative reference", but a normative part of
the standard.
> and wondered how to simply create a conforming collation,
> without importing the full text file (which is not only very
> large for an actual implementation,
One can preprocess it, as Mark indicated. But one cannot ignore
it and be compliant with the standard.
> but also incomplete face to Unicode 4,
This is known and being addressed for the next revision.
> and implements some custom tailorings that are NOT described
> in the UCA reference,
This reflects a fundamental misunderstanding of the role of the
allkeys.txt table, and is also simply wrong.
> still incomplete and probably contains a few incoherencies,
Incomplete, yes. But as Mark indicated, the file is regularly
tested for a large number of consistency issues, each time it
is updated.
> proving the fact that this file was edited manually,
It was not. The "proof" is fallacious.
What Philippe has demonstrated is a fact well known to those
who develop, maintain, or review the UCA standard and allkeys.txt:
the primary order definition and a number of other required
quirks in ordering depend on a specific input data file, and
cannot be derived automatically from UnicodeData.txt or any
other of the UCD data files.
If Philippe had inquired about its derivation, instead of
trumpeting his discoveries from reverse engineering, it
might have been possible to short-circuit a lot of the
FUD involved in the questions he has raised.
> and may contain errors or other omissions).
This is certainly possible.
>
> However, analyzing how the table was produced allows to
> create a simpler "meta"-description of its content, where this
> file could be generated from a much simpler file (or set of files),
Ta da! It *is* generated from a simpler set of files.
> so that such large table could be more easily maintained
> (even if there are some manual tailoring for specific scripts,
Tailoring is a process of changing *from* the default table.
It does not describe the definition of the primary (and other
particular orders) that go into the generation of the default
table itself.
> So despite I think that this table MAY be useful for some
> applications, I still think that it is not usable in the
> way it is presented.
Demonstrably false, since it *is* used as presented, by ICU
and by other implementers of UCA.
>
> Also my preious message clearly demonstrated that this
> collation table uses some sort of "collation decomposition"
> which includes some collation elements that can be thought
> as "variants" or "letter modifiers" for which there is no
> corresponding encoding in the normative UCD with an
> associated normative NFD or NFKD decomposition.
Again, this demonstration was a "discovery" of things that
are well known about the input file used for generating
allkeys.txt. Here's an example piece of the input data:
<quote>
# To make the spacing ypogegrammeni work best, it should be
# equated to the regular iota, rather than to the combining
# mark.
037A;GREEK YPOGEGRAMMENI;Lm;<sort> 03B9;;;;;
# 037A;GREEK YPOGEGRAMMENI;Lm;<compat> 0020 0345;;;;;
</quote>
Such modifications of compatibility decompositions (or the
addition of decompositions for which none exist in
UnicodeData.txt) are a required and reviewed part of creating
the input which the sifter then manipulates to generate
allkeys.txt (and the CTT table for ISO/IEC 14651).
>
> The current presentation of this table (with 4 collation
> weights per collation element) does not ease its implementation,
> and a simpler presentation with a unique weight (selected in a
> range that clearly indicates to which collation level it
> belongs to) would have been much more useful and much simpler
> to implement as well.
As Mark and I have both stated, anyone is free to preprocess
the allkeys.txt table into whatever form they choose for
their implementation. However, the current format of the table
is the result of consensus decision by the Unicode Technical
Committee, and is unlikely to be changed, since that would
destabilize it for implementers -- including those who have
tools to preprocess the current format into whatever format
they prefer to use.
--Ken
This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 17:28:51 EDT