Re: Name Compression. Comparison and Tweaks

From: Torsten Mohrin (mohrin@sharmahd.com)
Date: Sat May 13 2000 - 11:57:37 EDT


Kenneth Whistler wrote:

>Approach 2: Torsten Mohrin
>[...]
>Claimed compression: 262,000 bytes ==> 59,000 bytes for the recoded
>names + 29,000 bytes for the code->word lookup table. Total size,
>around 88,000 bytes.
>[...]
>My analysis of the Unicode 3.0.0 names list shows the names alone
>as comprising 278,346 bytes (counting a one-byte delimiter between
>each) for 10,538 names (omitting the control codes and the ranges
>of characters with algorithmically derivable names).

I just want to clarify the reasons for this difference:

1. The names of C0 and C1 controls (from Unicode 1.0) are also
encoded.

2. Names of CJK COMPATIBILITY IDEOGRAPHs (F900 -> FA2D) are derived
algorithmically, saving 9966 bytes.

3. Names of BRAILLE PATTERNs are also derived algorithmically, saving
6656 bytes. The compiled code for this algorithm takes less than 1 KB.
5.5 KB total saving is not much, but I couldn't resist :)

--Torsten



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT