Re: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 18 2003 - 06:31:32 EST

  • Next message: Theodore H. Smith: "Re: Ternary search trees for Unicode dictionaries"

    From: "Mark Davis" <mark.davis@jtcsv.com>
    > We tend to use tries, which have very good performance characteristics.
    See
    > "bits of unicode" on my site: www.macchiato.com.

    You'll find many references of books, and papers available in PDF format on:
    http://citeseer.nj.nec.com/147026.html

    Various aspects are studied in the list of references, including performance
    and data compression methods.

    Unicode is not an issue in this approach, but most of the referenced
    algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,
    but SCSU-8 could be used as well, provided that it does not break too much
    the redundancy of substrings for efficient detection and storage of
    non-prefix common substrings.

    An issue with SCSU-8 or even UTF-8 is that it still allows performing
    compares within the search/browse algorithm on something not better than
    just the internal binary ordering of encoded sequences (which is really
    distinct from the normal collation order you'd find in a classic
    dictionnary).

    So yes, the choice of an encoding appropriate to the language you're
    collating in the dictionnary is relevant and important prior to map the
    encoded strings in any tree structure.



    This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 07:24:16 EST