From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 06:45:53 CDT
On 21 Sep 2006, at 08:13, Asmus Freytag wrote:
>  If you assume a large alphabet, then your compression gets worse,  
> even if the actual number of elements are few.
So why would that be? - In one compression method, one just makes a  
frequency analysis on the characters used, and encodes based on that.  
So table entries need only be for characters actually used.
One way to do a character compression is to simply do a frequency  
analysis, sort the characters according to that, which gives a map  
code points -> code points. Then apply a variable width character  
encoding which gives smaller width to smaller non-negative integers,  
like say UTF-8, to that. Here, the compression method cannot do worse  
than UTF-8.
   Hans Aberg
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 06:48:26 CDT