RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 05 2003 - 17:01:02 EST

  • Next message: Michael Everson: "Re: Compression through normalization"

    Mark Davis writes:
    > Doug Ewell writes:
    > > OK. So it's Mark, not me, who is unilaterally extending C10.
    >
    > Where on earth do you get that? I did say that, in practice, NFC should be
    > produced, but that is simply a practical guideline, independent of C10.

    I also think that the NFC form is not required for the result of the
    decompression to respect clause C10. So if your intent is to create a
    compressor/decompressor that respects canonical equivalence, NFC is not
    required.

    Of course clause C10 cannot be fully respected for charset mappings;
    non-Unicode Korean charsets is one example where canonical equivalence
    cannot be guaranteed, and where in fact the Unicode codanonical equivalence
    is a pollution: mappings to/from non-Unicode charsets do not need to respect
    canonical equivalence, when this non-Unicode charset has its own canonical
    equivalence rules.

    It's just a shame that what was considered as equivalent in the Korean
    standards is considered as canonically distinct (and even compatibility
    dictinct) in Unicode. This means that the same exact abstract Korean text
    can have two distinct representation in Unicode and there's no way to match
    these Unicode representations together. And also that whan mapping Korean
    charsets to Unicode, care must be done, before making the mapping, that all
    compound jamaos will be used each time it is possible.

    If now the text is stored and handled entirely in Unicode without returning
    to the KSC standard, you won't have any other tool than just UCA to collate
    strings (but collation does not produces strings, just collation weights,
    and there's currently no tool to reverse a list of weights back to an
    Unicode string...

    ... unless the table of UCA collation weights is built as if it was a
    bidirectional mapping to a legacy charset, which would then become
    reversible and usable to perform various Unicode algorithms including case
    folding, or many other similar foldings defined in UTR...

    If someone adventures himself to define such collation charset and maps it
    to Unicode, then he will effectively create as many charset as collation
    orders tailored for a particuler language.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 17:40:21 EST