Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 06:36:16 CDT

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

Previous message: Hans Aberg: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Philippe Verdy: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 21 Sep 2006, at 07:01, Doug Ewell wrote:

> Different compression methods work in different ways. Certainly, a
> compression method that is specifically designed for Unicode text
> can take advantage of the unique properties of Unicode text, as
> compared to, say, photographic images.

I guess that is the simple point - the more structure, one can
recognize, the better a compression method can be done.

> I've often suspected that a Huffman or arithmetic encoder that
> encoded Unicode code points directly would perform better than a
> byte-based one working with UTF-8 code units. I haven't done the
> math to prove it, though.

And specifically, recognizing common words in natural languages is
something that can be done when working with Unicode code points, and
this is something that perhaps is harder to do with a byte-
compression method.

But it also hinges on how advanced the pattern recognition of a byte-
oriented compression method is: A character code point pattern can be
translated into a byte pattern in one character encoding, so it might
be in principle possible for the byte oriented compression method to
recognize it. But it5 then needs to be able to recognize multibyte
patterns, not only single bytes.

Hans Aberg

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"
Previous message: Hans Aberg: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Philippe Verdy: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 06:38:05 CDT