Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Mon Sep 18 2006 - 05:00:40 CDT

Next message: Chris Harvey: "Re: FW: Technology leads to cool fonts in Native language"

Previous message: Richard Wordingham: "Re: Unicode & space in programming & l10n"
In reply to: Mark Davis: "Re: Unicode & space in programming & l10n"
Next in thread: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 17 Sep 2006, at 23:41, Mark Davis wrote:

> > Technical bias arises in encoding schemes for text such as
> Unicode UTF-8, which causes text in a non-roman script to require
> two to three times more space than comparable text in a roman script

> Character frequency. One can't just compare the amount that a
> particular character will grow or shrink; you have to look at the
> frequency of usage of characters in the language.

It seems me that one should employ what might be called a character
compression method, i.e., a compression method compression the
character numbers (code points) rather than the encoded binary data,
as it is probably more efficient in view of how compression
algorithms work. (I.e. finding statistical regularities, and using a
variable size encoding, with smaller size for the more frequent
combinations.)

Then, of cause, the compressed size of a file with Unicode text, is
independent of the encoding (UTF-N, N = 7, 8, 15, 32, etc.) used.
These latter encodings can be used based on the other criteria alone.

Perhaps Unicode should take up the initiative, persuading
implementers of common compression formats to implement such
character compression methods.

Hans Aberg

Next message: Chris Harvey: "Re: FW: Technology leads to cool fonts in Native language"
Previous message: Richard Wordingham: "Re: Unicode & space in programming & l10n"
In reply to: Mark Davis: "Re: Unicode & space in programming & l10n"
Next in thread: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Sep 18 2006 - 05:07:02 CDT