Re: Unicode dictionary coding? UTF8, UTF32, etc

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 15 2003 - 04:35:15 EST

  • Next message: Peter Kirk: "Re: How can I input any Unicode character if I know its hexadecimal code?"

    From: "John Cowan" <jcowan@reutershealth.com>
    > Theodore H. Smith scripsit:
    > > Can someone give me some advice? If I was to write a dictionary class
    > > for Unicode, would I be better off writing it using a b-tree, or
    > > hash-bin system? Or maybe an array of pointers to arrays system?
    >
    > Google for "ternary search trees". It's a very interesting
    > technique which provides a combination of trie and binary tree.

    OK for that structure storage, but it will not answer to the prefered
    storage format.

    I think that if you are building a dictionnary in some language, you have to
    find a variable-length Huffman-like encoding appropriate to the character
    subset used by that language, in order to minimize storage space for your
    dictionnary.

    This effectively will create you own encoding scheme for Unicode, if you
    design your it so that the most widely used letters are given the shortest
    byte sequence, and the more rerely used Unicode character are given longer
    length if needed.

    So the input of your dictionnary will be reencoded before storing in the
    trie-like structure (which will detect and compact common prefixes).

    The other question to ask whever you need your dictionnary to be not only
    looked up rapidly (with binary search), but also read sequentially: the
    encoding format, to keep the sort order, may not be optimized as well as if
    only random lookup was performed, so that sequential read remains possible.
    Some data structures are inappropriate for efficient random read, such as
    hashcode-lookup tables, which would need to store in the leaf node
    additional pointers to keep the sequence order.

    If you need to keep your dictionnary in collation order, then it may be a
    good idea to encode your Unicode string by the sequence of its collation
    weight at its last collation level (in a reversible way), instead of
    sequence of code points, prior to storing it in the dictionnary. This has an
    additional benefit: you can efficiently lookup in your dictionnary by
    searching for matches at level 1 or 2 only, and get several strings that
    collate together with the searched string. So you can give to the user the
    capability of auto-correcting missing or bad accents.

    To do it, each of your dictionnary entries will contain all collation
    weights at level 1, then all weights at successive levels until the string
    is not ambiguous.
    The DUCET may help you, but I do think that it's up to your dictionnary to
    define its own collation order, and its encoding, and it should store
    somewhere in a header structure the association table between code points
    and sequences of collation weights used in the main trie structure.



    This archive was generated by hypermail 2.1.5 : Sat Nov 15 2003 - 05:09:39 EST