Re: Ternary search trees for Unicode dictionaries

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Nov 18 2003 - 23:19:08 EST

Next message: Mark Davis: "Proposed Successor to RFC 3066 (language tags)"

Previous message: Theodore H. Smith: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Reply: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Reply: Jungshik Shin: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> That's true that using SCSU requires that you use a stable compressor
> to use it in a Trie. I think it's a design issue of SCSU: there's no
> standard compressor in the standard to create a predictable compressed
> string, that can be compared. I think that SCSU would have benefited
> of having a "reference implementation" whose result can be compared
> with exact binary matching. In SCSU, you can't compare two compressed
> strings without first decompressing them to some UTF.

It's true, that is how SCSU was designed. There is no promise that any
two encoders will necessarily use the same strategy, or even the same
windows, to compress a given chunk of text. There is a reference
encoder at <ftp://www.unicode.org/Public/PROGRAMS/SCSU/>, written in the
language Microsoft now calls "Brand J," but conformant encoders are not
required to produce the same output.

> But this does not mean that SCSU cannot be used in a Trie to store a
> lexical dictionnary. Without it, of course, you can use BOCU, but it
> will compress strings less.

That's not an absolute; it depends on the input text. In my experience,
SCSU usually does perform somewhat better than BOCU-1, but for some
scripts (e.g. Korean) the opposite often seems to be true.

> There's however another option: use a reencoder that will take the LZW
> lookup dictionnary initially built from some UTF like UTF-32, and that
> will apply a statistic Huffman or Arithmetic encoding of each coded
> character...

You may be interested in a paper I've written on Unicode compression,
due out in a week or so. Basically it seems that virtually all existing
Huffman encoders (and probably arithmetic encoders as well) assume 8-bit
input, and there is a concern that retargeting such tools to work with
16- or 32-bit Unicode text may have a significant effect on speed. IMHO
the compression benefits would be big-time, though.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Mark Davis: "Proposed Successor to RFC 3066 (language tags)"
Previous message: Theodore H. Smith: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Reply: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Reply: Jungshik Shin: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 00:15:03 EST