From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 18 2003 - 06:31:32 EST
From: "Mark Davis" <mark.davis@jtcsv.com>
> We tend to use tries, which have very good performance characteristics.
See
> "bits of unicode" on my site: www.macchiato.com.
You'll find many references of books, and papers available in PDF format on:
http://citeseer.nj.nec.com/147026.html
Various aspects are studied in the list of references, including performance
and data compression methods.
Unicode is not an issue in this approach, but most of the referenced
algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,
but SCSU-8 could be used as well, provided that it does not break too much
the redundancy of substrings for efficient detection and storage of
non-prefix common substrings.
An issue with SCSU-8 or even UTF-8 is that it still allows performing
compares within the search/browse algorithm on something not better than
just the internal binary ordering of encoded sequences (which is really
distinct from the normal collation order you'd find in a classic
dictionnary).
So yes, the choice of an encoding appropriate to the language you're
collating in the dictionnary is relevant and important prior to map the
encoded strings in any tree structure.
This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 07:24:16 EST