From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 01 2003 - 07:25:39 EST
jon@hackcraft.net wrote:
> Further, a Unicode-aware algorithm would expect a choseong character to
> be followed by a jungseong and a jongseong to follow a jungsong, and
> could essentially perform the same benefits to compression that
> normalising to NFC perfroms but without making an irreversible change
> (i.e. it could tokenise the Jamo sequences rather than normalising and
> then tokenising).
Isn't it equivalent to what bzip2 does, but without knowledge of Unicode
composition rules, simply by discovering that jamos are structured
within their syllables, and creating, on the fly code positions to
represent their composition ?
A 2% difference can be explained by the fact that bzip2 must still
discover the new "clusters" by encoding them first in their decomposed
form before using codes to represent the composed forms for the rest of
the text.
> > Whether a "silent" normalization to NFC can be a legitimate part of
> > Unicode compression remains in question. I notice the list is still
> > split as to whether this process "changes" the text (because checksums
> > will differ) or not (because C10 says processes must consider the text
> > to be equivalent).
And what about a compressor that would identify the source as being
Unicode, and would convert it first to NFC, but including composed forms
for the compositions normally excluded from NFC? This seems marginal but
some languages would have better compression results when taking these
canonically equivalent compositions into account, such as pointed Hebrew
and Arabic.
__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 08:00:12 EST