From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 24 2003 - 13:06:23 EST
On 24/11/2003 07:52, Mark E. Shoulson wrote:
> On 11/24/03 01:26, Doug Ewell wrote:
>
>> So the question becomes: Is it legitimate for a Unicode compression
>> engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
>> another (canonically equivalent) normalization form to improve its
>> compressibility?
>>
> OK, this *is* a fascinating question. ...
...
It seems to me that there is some kind of mixing of levels here. At one
level, we have a text which consists of a string of Unicode characters,
and this is the string which can be normalised or denormalised (in fact
any transformation preserving canonical equivalence) at will. At a lower
level, we have a sequence of bytes or whatever in a Unicode encoding
form. And at a still lower level we have a sequence of bytes, which, at
this level, have no known interpretation. And it is surely at this level
that lossless compression should operate. Now such a compression scheme
may receive and use information from a higher level that the byte stream
is in a particular encoding form of Unicode, and may make use of that
information as a hint. But it should take this as nothing more than a
hint, not necessarily reliable, and preserve the byte stream through
compression and decompression.
If conformance clause C10 is taken to be operable at all levels, this
makes a nonsense of the concept of normalisation stability within
databases etc. If a low level process is permitted to make any
canonically equivalent transformation, then there can be no guarantee
that data which is stored in a particular normalisation form is
retrievable in that same normalisation form, for maybe a low level
compression or other process has transformed the data on the disk or
tape or on its way to or from it.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 14:01:25 EST