From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 25 2003 - 17:35:15 EST
Doug Ewell writes:
> * Philippe Verdy and and Jill Ramonsky say YES, a compressor can
> normalize, because it knows it is operating on Unicode character data
> and can take advantage of Unicode properties.
I say YES only for compressors that are supposed to work on Unicode text
(this applies to BOCU-1 and SCSU which are not intented to compress anything
else than Unicode text), but NO of course for general purpose compressors
(like deflate in zip files.)
I will say NO for encoding forms that are normally built to be directly
parsable code point by codepoint in any direction and from random locations
in strings. So a UTF encoding scheme is not supposed to change the
normalization form.
> * Peter Kirk and Mark Shoulson say NO, it can't, because all the
> compressor really knows about is the byte stream, so it must be
> preserved byte-for-byte.
But SCSU and BOCU-1 do not operate in the byte stream level, as their use is
invalid on random streams of bytes, but only defined in terms of streams of
code units... That's why I won't say that SCSU and BOCU-1 are really
compressors, but rather really encoding schemes (CES in the ISO10646
terminology).
In fact the result of BOCU-1 and SCSU encoding schemes can create a file
which has its own charset (i.e. CCS+CES in the ISO terminology), and thus
can also have its own label for MIME usage or in XML charset declarations.
This is not a limitation, as true compressors can still be used if needed
from this encoding scheme, or transparently within transport layers (such as
the "Content-Transfer-Encoding:" in MIME and HTTP applications).
> * I'm still not sure, but I'm leaning toward NO.
__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 18:31:44 EST