From: Hans Aberg (haberg@math.su.se)
Date: Mon Sep 18 2006 - 05:00:40 CDT
On 17 Sep 2006, at 23:41, Mark Davis wrote:
> > Technical bias arises in encoding schemes for text such as
> Unicode UTF-8, which causes text in a non-roman script to require
> two to three times more space than comparable text in a roman script
> Character frequency. One can't just compare the amount that a
> particular character will grow or shrink; you have to look at the
> frequency of usage of characters in the language.
It seems me that one should employ what might be called a character
compression method, i.e., a compression method compression the
character numbers (code points) rather than the encoded binary data,
as it is probably more efficient in view of how compression
algorithms work. (I.e. finding statistical regularities, and using a
variable size encoding, with smaller size for the more frequent
combinations.)
Then, of cause, the compressed size of a file with Unicode text, is
independent of the encoding (UTF-N, N = 7, 8, 15, 32, etc.) used.
These latter encodings can be used based on the other criteria alone.
Perhaps Unicode should take up the initiative, persuading
implementers of common compression formats to implement such
character compression methods.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Mon Sep 18 2006 - 05:07:02 CDT