RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 01 2003 - 07:25:39 EST

  • Next message: jon@hackcraft.net: "RE: Compression through normalization"

    jon@hackcraft.net wrote:
    > Further, a Unicode-aware algorithm would expect a choseong character to
    > be followed by a jungseong and a jongseong to follow a jungsong, and
    > could essentially perform the same benefits to compression that
    > normalising to NFC perfroms but without making an irreversible change
    > (i.e. it could tokenise the Jamo sequences rather than normalising and
    > then tokenising).

    Isn't it equivalent to what bzip2 does, but without knowledge of Unicode
    composition rules, simply by discovering that jamos are structured
    within their syllables, and creating, on the fly code positions to
    represent their composition ?

    A 2% difference can be explained by the fact that bzip2 must still
    discover the new "clusters" by encoding them first in their decomposed
    form before using codes to represent the composed forms for the rest of
    the text.

    > > Whether a "silent" normalization to NFC can be a legitimate part of
    > > Unicode compression remains in question. I notice the list is still
    > > split as to whether this process "changes" the text (because checksums
    > > will differ) or not (because C10 says processes must consider the text
    > > to be equivalent).

    And what about a compressor that would identify the source as being
    Unicode, and would convert it first to NFC, but including composed forms
    for the compositions normally excluded from NFC? This seems marginal but
    some languages would have better compression results when taking these
    canonically equivalent compositions into account, such as pointed Hebrew
    and Arabic.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 08:00:12 EST