Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 24 2003 - 13:06:23 EST

Next message: John Cowan: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"

Previous message: Mark E. Shoulson: "Re: Compression through normalization"
In reply to: Mark E. Shoulson: "Re: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 24/11/2003 07:52, Mark E. Shoulson wrote:

> On 11/24/03 01:26, Doug Ewell wrote:
>
>> So the question becomes: Is it legitimate for a Unicode compression
>> engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
>> another (canonically equivalent) normalization form to improve its
>> compressibility?
>>
> OK, this *is* a fascinating question. ...

...

It seems to me that there is some kind of mixing of levels here. At one
level, we have a text which consists of a string of Unicode characters,
and this is the string which can be normalised or denormalised (in fact
any transformation preserving canonical equivalence) at will. At a lower
level, we have a sequence of bytes or whatever in a Unicode encoding
form. And at a still lower level we have a sequence of bytes, which, at
this level, have no known interpretation. And it is surely at this level
that lossless compression should operate. Now such a compression scheme
may receive and use information from a higher level that the byte stream
is in a particular encoding form of Unicode, and may make use of that
information as a hint. But it should take this as nothing more than a
hint, not necessarily reliable, and preserve the byte stream through
compression and decompression.

If conformance clause C10 is taken to be operable at all levels, this
makes a nonsense of the concept of normalisation stability within
databases etc. If a low level process is permitted to make any
canonically equivalent transformation, then there can be no guarantee
that data which is stored in a particular normalisation form is
retrievable in that same normalisation form, for maybe a low level
compression or other process has transformed the data on the disk or
tape or on its way to or from it.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: John Cowan: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Previous message: Mark E. Shoulson: "Re: Compression through normalization"
In reply to: Mark E. Shoulson: "Re: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 14:01:25 EST