From: jon@hackcraft.net
Date: Mon Dec 01 2003 - 08:12:40 EST
Quoting Philippe Verdy <verdy_p@wanadoo.fr>:
> jon@hackcraft.net wrote:
> > Further, a Unicode-aware algorithm would expect a choseong character to
> > be followed by a jungseong and a jongseong to follow a jungsong, and
> > could essentially perform the same benefits to compression that
> > normalising to NFC perfroms but without making an irreversible change
> > (i.e. it could tokenise the Jamo sequences rather than normalising and
> > then tokenising).
>
> Isn't it equivalent to what bzip2 does, but without knowledge of Unicode
> composition rules, simply by discovering that jamos are structured
> within their syllables, and creating, on the fly code positions to
> represent their composition ?
I imagine so.
> A 2% difference can be explained by the fact that bzip2 must still
> discover the new "clusters" by encoding them first in their decomposed
> form before using codes to represent the composed forms for the rest of
> the text.
Yes. Do we care about that 2%? Can we improve upon it?
> > > Whether a "silent" normalization to NFC can be a legitimate part of
> > > Unicode compression remains in question. I notice the list is still
> > > split as to whether this process "changes" the text (because checksums
> > > will differ) or not (because C10 says processes must consider the text
> > > to be equivalent).
>
> And what about a compressor that would identify the source as being
> Unicode, and would convert it first to NFC, but including composed forms
> for the compositions normally excluded from NFC? This seems marginal but
> some languages would have better compression results when taking these
> canonically equivalent compositions into account, such as pointed Hebrew
> and Arabic.
Agreed, if we are to rely upon the equivalence of sequences then there is no
need to exclude such compositions.
This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 09:04:16 EST