Re: Unicode dictionary coding? UTF8, UTF32, etc

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 15 2003 - 04:35:15 EST

Next message: Peter Kirk: "Re: How can I input any Unicode character if I know its hexadecimal code?"

Previous message: Rick Cameron: "RE: How can I input any Unicode character if I know its hexadecim al code?"
In reply to: John Cowan: "Re: Unicode dictionary coding? UTF8, UTF32, etc"
Next in thread: Markus Scherer: "Re: Unicode dictionary coding? UTF8, UTF32, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "John Cowan" <jcowan@reutershealth.com>
> Theodore H. Smith scripsit:
> > Can someone give me some advice? If I was to write a dictionary class
> > for Unicode, would I be better off writing it using a b-tree, or
> > hash-bin system? Or maybe an array of pointers to arrays system?
>
> Google for "ternary search trees". It's a very interesting
> technique which provides a combination of trie and binary tree.

OK for that structure storage, but it will not answer to the prefered
storage format.

I think that if you are building a dictionnary in some language, you have to
find a variable-length Huffman-like encoding appropriate to the character
subset used by that language, in order to minimize storage space for your
dictionnary.

This effectively will create you own encoding scheme for Unicode, if you
design your it so that the most widely used letters are given the shortest
byte sequence, and the more rerely used Unicode character are given longer
length if needed.

So the input of your dictionnary will be reencoded before storing in the
trie-like structure (which will detect and compact common prefixes).

The other question to ask whever you need your dictionnary to be not only
looked up rapidly (with binary search), but also read sequentially: the
encoding format, to keep the sort order, may not be optimized as well as if
only random lookup was performed, so that sequential read remains possible.
Some data structures are inappropriate for efficient random read, such as
hashcode-lookup tables, which would need to store in the leaf node
additional pointers to keep the sequence order.

If you need to keep your dictionnary in collation order, then it may be a
good idea to encode your Unicode string by the sequence of its collation
weight at its last collation level (in a reversible way), instead of
sequence of code points, prior to storing it in the dictionnary. This has an
additional benefit: you can efficiently lookup in your dictionnary by
searching for matches at level 1 or 2 only, and get several strings that
collate together with the searched string. So you can give to the user the
capability of auto-correcting missing or bad accents.

To do it, each of your dictionnary entries will contain all collation
weights at level 1, then all weights at successive levels until the string
is not ambiguous.
The DUCET may help you, but I do think that it's up to your dictionnary to
define its own collation order, and its encoding, and it should store
somewhere in a header structure the association table between code points
and sequences of collation weights used in the main trie structure.

Next message: Peter Kirk: "Re: How can I input any Unicode character if I know its hexadecimal code?"
Previous message: Rick Cameron: "RE: How can I input any Unicode character if I know its hexadecim al code?"
In reply to: John Cowan: "Re: Unicode dictionary coding? UTF8, UTF32, etc"
Next in thread: Markus Scherer: "Re: Unicode dictionary coding? UTF8, UTF32, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Nov 15 2003 - 05:09:39 EST