Re: Ternary search trees for Unicode dictionaries

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Nov 18 2003 - 23:19:08 EST

  • Next message: Mark Davis: "Proposed Successor to RFC 3066 (language tags)"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > That's true that using SCSU requires that you use a stable compressor
    > to use it in a Trie. I think it's a design issue of SCSU: there's no
    > standard compressor in the standard to create a predictable compressed
    > string, that can be compared. I think that SCSU would have benefited
    > of having a "reference implementation" whose result can be compared
    > with exact binary matching. In SCSU, you can't compare two compressed
    > strings without first decompressing them to some UTF.

    It's true, that is how SCSU was designed. There is no promise that any
    two encoders will necessarily use the same strategy, or even the same
    windows, to compress a given chunk of text. There is a reference
    encoder at <ftp://www.unicode.org/Public/PROGRAMS/SCSU/>, written in the
    language Microsoft now calls "Brand J," but conformant encoders are not
    required to produce the same output.

    > But this does not mean that SCSU cannot be used in a Trie to store a
    > lexical dictionnary. Without it, of course, you can use BOCU, but it
    > will compress strings less.

    That's not an absolute; it depends on the input text. In my experience,
    SCSU usually does perform somewhat better than BOCU-1, but for some
    scripts (e.g. Korean) the opposite often seems to be true.

    > There's however another option: use a reencoder that will take the LZW
    > lookup dictionnary initially built from some UTF like UTF-32, and that
    > will apply a statistic Huffman or Arithmetic encoding of each coded
    > character...

    You may be interested in a paper I've written on Unicode compression,
    due out in a week or so. Basically it seems that virtually all existing
    Huffman encoders (and probably arithmetic encoders as well) assume 8-bit
    input, and there is a concern that retargeting such tools to work with
    16- or 32-bit Unicode text may have a significant effect on speed. IMHO
    the compression benefits would be big-time, though.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 00:15:03 EST