From: Doug Ewell (dewell@adelphia.net)
Date: Tue Nov 25 2003 - 18:13:34 EST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> I say YES only for compressors that are supposed to work on Unicode
> text (this applies to BOCU-1 and SCSU which are not intented to
> compress anything else than Unicode text), but NO of course for
> general purpose compressors (like deflate in zip files.)
Of course.
> I will say NO for encoding forms that are normally built to be
> directly parsable code point by codepoint in any direction and from
> random locations in strings. So a UTF encoding scheme is not supposed
> to change the normalization form.
Of course not. Or so I would imagine, anyway. After all, if a process
(see Peter Kirk's question) that compresses Unicode text can silently
change the normalization form, then why not a process that stores and
retrieves Unicode text using, say, UTF-8? But that sounds wrong to me,
although it's what C10 says.
>> * Peter Kirk and Mark Shoulson say NO, it can't, because all the
>> compressor really knows about is the byte stream, so it must be
>> preserved byte-for-byte.
>
> But SCSU and BOCU-1 do not operate in the byte stream level, as their
> use is invalid on random streams of bytes, but only defined in terms
> of streams of code units...
That's right. I tend to agree with the NO camp not because SCSU and
BOCU-1 are going to be applied to arbitrary binary data, but because the
*format* in which text is stored isn't normally expected to change the
contents.
Converting Unicode text from UTF-16LE to UTF-16BE, or UTF-16 to UTF-8,
changes the bits. Everyone can see that. But the *code units*
represented by those bits are not changed. If the UTF-16BE sequence <00
61 03 01> were converted to the UTF-8 sequence <C3 A1>, that would be a
change not only in the bits, but in the code units as well. This is
where the question lies.
> That's why I won't say that SCSU and BOCU-1 are really compressors,
> but rather really encoding schemes (CES in the ISO10646 terminology).
They are transfer encoding syntaxes (TES). And I believe this
terminology is from Unicode, not 10646, though I could be wrong.
I would say encoders for SCSU and BOCU-1 are compressors. They're just
not general-purpose compressors.
> In fact the result of BOCU-1 and SCSU encoding schemes can create a
> file which has its own charset (i.e. CCS+CES in the ISO terminology),
> and thus can also have its own label for MIME usage or in XML charset
> declarations. This is not a limitation, as true compressors can still
> be used if needed from this encoding scheme, or transparently within
> transport layers (such as the "Content-Transfer-Encoding:" in MIME and
> HTTP applications).
Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a
GP compression scheme. Atkin and Stansifer's paper from last year is
all about that, and I spend a few pages on it in my paper as well. You
can also re-Zip a Zip file, though, so I don't know what that proves
about the compression formats.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 19:01:15 EST