re: Ordering of scripts in DUCET?

From: verdy_p (verdy_p@wanadoo.fr)
Date: Tue Dec 02 2008 - 17:01:45 CST


> De : "Harold S. Henry" <harold@talerian.com>
> A : unicode@unicode.org
> Copie à :
> Objet : Ordering of scripts in DUCET?
>
> Is there a documented logic or specification that governs the ordering of
> scripts in DUCET? In other words, is there a way to predict where future
> scripts will be inserted into the current primary collation order for
> letters?

There's at least a document structure by type of entries:
- ignored
- ignorable
- diacritics
- symbols and punctuation
- numbers
- letters and alphabets.
But I've not seen a clear statement about the order of letters (or of numbers) according to the script to which
they belong. From what I've seen, it looks like scripts are ordered by the first Unicode/ISO 10646 character block
in which they appear for the first time, and so you could "predict" the layout for future scripts as being more or
like what is displayed as a preview in the Unicode Road Map.

But there may exist other reasons why this order would not be kept: it is more important to keep the DUCET with
their scripts ordered in a way that is consistant with at least one of the major languages that use this script. So
the effective order is based on what is expected for collating this primary language, in order to minimize the
number of tailoring rules needed for supporting that language in its primary collation order (that other languages
will simply borrow by default, simply because they don't regulate these other scripts).

I think that the goal is effectively to have tailored collection as simple and small as possible, so that they can
be created with very few entries specific for some locale. The smallest these collation tailoring are, the best it
is for everyone (but therer will still remain some locales for more "minor" languages or collation conventions in
which a more complex tailoring may be needed.

There are at least two good reasons for this :
* the DUCET may be preencoded in a text parser using very efficient representations, but supporting locale-
specifific collations has a cost in terms of implementation (in terms of code complexity, memory use and
performance), because creating and loading a tailored collation (notably dynamically at run-time) requires a LOT of
operations in order to prepare the various lookup tables, or the LOT of many operations that may be needed if
working with specific and slow "hooks" within the DUCET implementation code.
* supporting tailored collations means that you must be able to inject new collation key elements within the DUCET,
without having to renumber it if possible (renumbering may be impossible to perform in a reasonable time, if
collation keys are computed and then stored externally, such as in a database); however the "gaps" available in the
DUCET for allowing minimum tailoring may be insufficient to support complex tailoring, so renumbering keys would
suddenly become unavoidable.

That's why the DUCET must (should?) be prepared in a way that has been tested according to several other tailored
collations that may be needed to support some other collation orders than just the primary one that was selected as
the base for building it.

It is interesting that the CLDR project now collects the tailored collations for various locales, not just the
primary languages. This can allow testing future versions of the DUCET in a way that will allow reducing the
complexity of code needed for some locale, and it may also allow the DUCET to insert larger gaps within some parts
of the table to allow easier implementation of the typically tailorings that may be needed, without forcing the
DUCET users to rebuild (and renumber) their own equivalent version of the DUCET.

Note: the DUCET should be stable, to avoid changing the collation order in existing applications working in
existing locales. But the stability is not in terms of the exact numeric value of the collation key elements; what
is stable (or should be stable) is only the relative numeric order of collation key elements at some collation
level, the level at which non-null key elements are given, and the number of key elements displayed.

If some existing collation elements are bogous however, they may eventually be changed to correct some severe
defects, but not if this is very minor and occurs in very limited specific cases: those applications that may want
a correction for some keys can still use their own simple tailoring rules on top of the DUCET to override it.

And in fact, this could also be documented somewhere, e.g. in the CLDR repository, to avoid breaking existing
applications where an unexpected change of collation order could generate severe issues (like in relational
database SQL sub-selections by range used in financial or statistics applications).

On the opposite, those critical applications that want exact sub-selections should never depend directly on
collation order, not even in a specific locale, but they should really revize their data schema to create their own
relational classification tables and ids, or other metadata systems for their collections of full-text or XML documents)...

Philippe.



This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST