Re: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 18 2003 - 06:31:32 EST

Next message: Theodore H. Smith: "Re: Ternary search trees for Unicode dictionaries"

Previous message: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Mark Davis: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
Reply: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Mark Davis" <mark.davis@jtcsv.com>
> We tend to use tries, which have very good performance characteristics.
See
> "bits of unicode" on my site: www.macchiato.com.

You'll find many references of books, and papers available in PDF format on:
http://citeseer.nj.nec.com/147026.html

Various aspects are studied in the list of references, including performance
and data compression methods.

Unicode is not an issue in this approach, but most of the referenced
algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,
but SCSU-8 could be used as well, provided that it does not break too much
the redundancy of substrings for efficient detection and storage of
non-prefix common substrings.

An issue with SCSU-8 or even UTF-8 is that it still allows performing
compares within the search/browse algorithm on something not better than
just the internal binary ordering of encoded sequences (which is really
distinct from the normal collation order you'd find in a classic
dictionnary).

So yes, the choice of an encoding appropriate to the language you're
collating in the dictionnary is relevant and important prior to map the
encoded strings in any tree structure.

Next message: Theodore H. Smith: "Re: Ternary search trees for Unicode dictionaries"
Previous message: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Mark Davis: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
Reply: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 07:24:16 EST