RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 25 2003 - 17:35:15 EST

Next message: Philippe Verdy: "RE: Compression through normalization"

Previous message: Philippe Verdy: "RE: numeric properties of Nl characters in the UCD"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell writes:
> * Philippe Verdy and and Jill Ramonsky say YES, a compressor can
> normalize, because it knows it is operating on Unicode character data
> and can take advantage of Unicode properties.

I say YES only for compressors that are supposed to work on Unicode text
(this applies to BOCU-1 and SCSU which are not intented to compress anything
else than Unicode text), but NO of course for general purpose compressors
(like deflate in zip files.)

I will say NO for encoding forms that are normally built to be directly
parsable code point by codepoint in any direction and from random locations
in strings. So a UTF encoding scheme is not supposed to change the
normalization form.

> * Peter Kirk and Mark Shoulson say NO, it can't, because all the
> compressor really knows about is the byte stream, so it must be
> preserved byte-for-byte.

But SCSU and BOCU-1 do not operate in the byte stream level, as their use is
invalid on random streams of bytes, but only defined in terms of streams of
code units... That's why I won't say that SCSU and BOCU-1 are really
compressors, but rather really encoding schemes (CES in the ISO10646
terminology).

In fact the result of BOCU-1 and SCSU encoding schemes can create a file
which has its own charset (i.e. CCS+CES in the ISO terminology), and thus
can also have its own label for MIME usage or in XML charset declarations.
This is not a limitation, as true compressors can still be used if needed
from this encoding scheme, or transparently within transport layers (such as
the "Content-Transfer-Encoding:" in MIME and HTTP applications).

> * I'm still not sure, but I'm leaning toward NO.

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Philippe Verdy: "RE: Compression through normalization"
Previous message: Philippe Verdy: "RE: numeric properties of Nl characters in the UCD"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 18:31:44 EST