Re: DUCET and supplementary foldings (was: Looking for transcription or transliteration standards latin- >arabic)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jul 13 2004 - 01:23:36 CDT

  • Next message: Peter Kirk: "Re: Importance of diacritics"

    From: "Asmus Freytag" <asmusf@ix.netcom.com>
    > I have a certain sympathy for the idea of designing UCA so that the
    > untailored *default* works for such kind of multilingual usage. However,
    > the other use of the DUCET is to be the most convenient base for applying
    > all tailorings. I have a certain sympathy for the position that claims
    that
    > there are important, but perhaps specialized or not economically powerful
    > classes of users that will not likely have access to a tailored UCA for
    > their language or writing system.
    >
    > If that is really the case, i.e. appreciable numbers of smaller languages
    > would be able to survive without tailoring, then the alternative to fixing
    > the DUCET could be a separate publication of a common base tailoring for
    > multilingual data access. (A base tailoring would be applied before
    further
    > tailoring for a specific language).

    I appreciate much this analysis. The DUCET has effectively two supposed
    usages, whose purposes are opposed. If used as a base collation from which a
    language-specific collation can be built simply with few rules, it's true
    that the other common usage needed for multilanguage searches is not easy to
    build.

    May be we could think about designing a new standard collation tailoring
    table which could be used as an alternative to the DUCET, but targetting
    multilanguage searches.

    And so, such tailoring would include more folding than the DUCET, putting
    the differences at a higher weight level. And give it a name (MUCET? for
    Multilanguage Unicode Collation Elements Table?) that would be supported as
    well.

    The DUCET is now quite stable and there's no need to change it, as it is now
    well known and certainly used in many applications that depend on it (RDBMS
    engines notably). But a MUCET would be certainly useful, including for users
    that would no more need to search for multiple words in a multilanguage
    database or simply for the web. Nothing forbids, in addition, to sort the
    matching entries by relevance using the DUCET as a secondary collation
    order.

    After all a collation elements table works exactly like a custom
    decomposition table that creates additional strings whose encoding is not
    portable as it depends on weight values. Using custom decompositions is
    often much simpler than implementing a multilevel collation, using existing
    algorithms implemented for NFD and NFKD decompositions. In such a view, some
    extra decompositions are needed, using non-standard Unicode characters for
    some elements (for example when decomposing a AE letter into a ligature with
    an extra custom control with a higher collation level, to be used only for
    full collation order but that could be ignored for searches limited at level
    1 or 2).



    This archive was generated by hypermail 2.1.5 : Tue Jul 13 2004 - 01:25:47 CDT