Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Nov 25 2003 - 18:13:34 EST

Next message: Doug Ewell: "Re: What is a process?"

Previous message: Philippe Verdy: "RE: Compression through normalization"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> I say YES only for compressors that are supposed to work on Unicode
> text (this applies to BOCU-1 and SCSU which are not intented to
> compress anything else than Unicode text), but NO of course for
> general purpose compressors (like deflate in zip files.)

Of course.

> I will say NO for encoding forms that are normally built to be
> directly parsable code point by codepoint in any direction and from
> random locations in strings. So a UTF encoding scheme is not supposed
> to change the normalization form.

Of course not. Or so I would imagine, anyway. After all, if a process
(see Peter Kirk's question) that compresses Unicode text can silently
change the normalization form, then why not a process that stores and
retrieves Unicode text using, say, UTF-8? But that sounds wrong to me,
although it's what C10 says.

>> * Peter Kirk and Mark Shoulson say NO, it can't, because all the
>> compressor really knows about is the byte stream, so it must be
>> preserved byte-for-byte.
>
> But SCSU and BOCU-1 do not operate in the byte stream level, as their
> use is invalid on random streams of bytes, but only defined in terms
> of streams of code units...

That's right. I tend to agree with the NO camp not because SCSU and
BOCU-1 are going to be applied to arbitrary binary data, but because the
*format* in which text is stored isn't normally expected to change the
contents.

Converting Unicode text from UTF-16LE to UTF-16BE, or UTF-16 to UTF-8,
changes the bits. Everyone can see that. But the *code units*
represented by those bits are not changed. If the UTF-16BE sequence <00
61 03 01> were converted to the UTF-8 sequence <C3 A1>, that would be a
change not only in the bits, but in the code units as well. This is
where the question lies.

> That's why I won't say that SCSU and BOCU-1 are really compressors,
> but rather really encoding schemes (CES in the ISO10646 terminology).

They are transfer encoding syntaxes (TES). And I believe this
terminology is from Unicode, not 10646, though I could be wrong.

I would say encoders for SCSU and BOCU-1 are compressors. They're just
not general-purpose compressors.

> In fact the result of BOCU-1 and SCSU encoding schemes can create a
> file which has its own charset (i.e. CCS+CES in the ISO terminology),
> and thus can also have its own label for MIME usage or in XML charset
> declarations. This is not a limitation, as true compressors can still
> be used if needed from this encoding scheme, or transparently within
> transport layers (such as the "Content-Transfer-Encoding:" in MIME and
> HTTP applications).

Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a
GP compression scheme. Atkin and Stansifer's paper from last year is
all about that, and I spend a few pages on it in my paper as well. You
can also re-Zip a Zip file, though, so I don't know what that proves
about the compression formats.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Doug Ewell: "Re: What is a process?"
Previous message: Philippe Verdy: "RE: Compression through normalization"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 19:01:15 EST