RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 25 2003 - 18:10:53 EST

Next message: Doug Ewell: "Re: Compression through normalization"

Previous message: Philippe Verdy: "RE: Compression through normalization"
In reply to: Mark Davis: "Re: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis writes:
> I would say that a compressor can normalize, if (a) when decompressing it
> produces NFC, and (b) it advertises that it normalizes.

Why condition (a) ? NFD could be used as well, and even another
normalization where combining characters are sorted differently, or partly
recomposed, or even recomposed by ignoring the composition exclusions, as
long as the result is a canonical equivalent.

Whatever the compressor produces, there's no way to specify the
normalization form in the result: there's no standard to indicate it in the
output stream.

The relevant standard is using a MIME or IANA charset, which just specifies
a pair consisting in a CCS (coded character set, i.e. for us the
Unicode/ISO/IEC 10646 assigned codepoints) and a CES (for us it it the
encoding scheme). The normalization form has no standard convention to
advertize it.

This imples that any transport protocol cannot assume any normalization form
of Unicode, even if it's specified with UTF-*, UCS*, BOCU*, SCSU.
Normalization becomes a normal step in all interchanges, including for
compression purpose. Unicode already says that all noramlization forms are
canonically equivalent and must be treated equally.

I see no justification of accepting some VALID Unicode text and rejected
some other VALID text, when both texts are canonically equivalent. The
interaction of C9 and C10 implies that any process that claims respecting
the canonical equivalence must perform the normalization of its input, or be
SURE that the input is already normalized the same way as expected. There's
no other way to be SURE of that, if both processes are not part of the same
local system and they don't share the same normalization library for their
implementation at ANY time.

If there's a delay between those two processes and the system is upgraded,
you'll experiment problems, unless the intermediate results from the first
process is renormalized with the newer implementation before attempting any
use of the second process. If the intermediate result is for example a RDBMS
database, the database needs to be checked and cleaned up with the new
normalization to allow correct access to tables through binary sorted
indices with the upgraded RDBMS engine. In practive, this means rebuilding
the indices, unless the database also stores somewhere which normalization
form is used in its indices, and the engine performs the necessary
normalization on the fly to match storage requirements...

For me a process that accepts some text but not some other canonical one is
NOT conforming to the claim that it respects canonical equivalence, and so
it is only a partial implementation of Unicode.

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Doug Ewell: "Re: Compression through normalization"
Previous message: Philippe Verdy: "RE: Compression through normalization"
In reply to: Mark Davis: "Re: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 18:53:54 EST