From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Nov 18 2003 - 12:25:03 EST
Philippe Verdy wrote:
> Unicode is not an issue in this approach, but most of the referenced
> algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,
Either UTF-8 or UTF-16 would probably work, depending on the rest of the application. UTF-16 may be
better because it fits with many Unicode libraries and APIs that work with 16-bit Unicode strings.
> but SCSU-8 could be used as well, provided that it does not break too much
> the redundancy of substrings for efficient detection and storage of
> non-prefix common substrings.
There is no such thing as SCSU-8.
SCSU is probably not an option because it is not a deterministic encoding. A small change in the
encoder will result in a very different dictionary structure, and/or a failure to find a word. In
order to use SCSU, you would have to write, control and stabilize your own encoder.
BOCU-1 would work if you always start looking up words from their beginning, not subwords from the
middle of a word.
UTF-16 is probably best.
> An issue with SCSU-8 or even UTF-8 is that it still allows performing
> compares within the search/browse algorithm on something not better than
> just the internal binary ordering of encoded sequences (which is really
> distinct from the normal collation order you'd find in a classic
> dictionnary).
If you need linguistically correct comparisons according to some sort order, then you could use the
collation elements, or parts of them, for a word instead of the characters.
http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html#Collation_Element_Iterator
markus
This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 13:29:58 EST