Re: Name Compression. Comparison and Tweaks

From: Pierpaolo Bernardi (bernardp@cli.di.unipi.it)
Date: Fri May 19 2000 - 12:38:15 EDT


On Fri, 12 May 2000, Kenneth Whistler wrote:

> Approach 3: Pierpaolo
>
> In a Lisp environment, just put each unique word in the names into
> a table, and then treat each name as a vector of vectors.

A vector of indexes (16 bits each).

> Claimed compression: not designated. However, I don't think this
> approach would be a very competitive one when it comes to compression.

Agreed. I was not trying to be ultrasmart, just trying to evitate the
obvious waste. Maybe I'll implement one of the other proposed schemas,
if I put my hand again on this portion of the code.

My implementation has the advantage that the whole thing, preparing the
tables, printing the encoded names, etc, is around ten lines of code.

> My analysis of the Unicode 3.0.0 names list shows the names alone
> as comprising 278,346 bytes (counting a one-byte delimiter between
> each) for 10,538 names (omitting the control codes and the ranges
> of characters with algorithmically derivable names). If SPACE and
> hyphus are taken to be word delimiters for the purposes of name
> analysis, the list has 43,430 word tokens in it.
>
> Without doing some other fancy work, representation of a list of
> 43,430 word tokens by pointing to a table of unique word tokens
> is going to take 43,430 pointers, and at 4 bytes each for a typical
> 32-bit implementation, that is 173,720 bytes, without counting the
> table of unique words itself.

I'm not using using pointers but 16-bit indexes.

Have a nice weekend.

  Pierpaolo



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT