RE: Compression and Unicode [was: Name Compression]

From: Marco.Cimarosti@icl.com
Date: Fri May 12 2000 - 06:10:28 EDT


Asmus Freytag wrote:
> I wonder whether he [Torsten] has measured the difference
> in size from the effect of dedicating one bit to the
> cause of SPACE vs HYPHEN. [...]
> Hard to know how many words really show up in both forms.

OK: I did the counting, hoping it is worth something. There are 76 such
cases (see list in l_hs.txt). I also counted words that can be followed by
"-" or occur at end of name: 68 cases, many of which are shared with
previous list (see list in l_he.txt).

I noticed that Torsten's scheme assumes that " " and "-" are mutually
exclusive separators, but this is not true for a handful of Tibetan
characters that have sequences like " -" or "- " (see list in l_xx.txt). How
are these cases handled?

> His word table could be further compressed.

Do you refer to the fact that many words are identical to the trailing part
of longer words?

Unluckily, this cannot be exploited if the names are encoded as sequences of
indexed to *words*, as in Torsten's scheme.

To take advantage of this feature of the data, the word indexes should be
substituted by *character* indexes to the beginning of the name.

This, however, would screw up Torsten's approach to assign the shortest
possible ids to the most common words, in order to save space.

> PS:) coders will be coders, they like to invent new coding schemes

Yes, why using the same old standard wheels, when you can have such a good
time reinventing your own model?

_ Marco









This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT