Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Doug Ewell (dewell@adelphia.net)
Date: Fri Dec 12 2003 - 01:13:13 EST

  • Next message: jameskass@att.net: "RE: character map in Microsoft Word"

    Kenneth Whistler <kenw at sybase dot com> wrote:

    > It is perfectly conformant with the Unicode Standard to assert
    > that <U+00E9> "é" and <U+0065, U+0301> "é" are different
    > Unicode strings. They *are* different Unicode strings. They
    > contain different encoded characters, and they have different
    > lengths.
    > ...
    > What canonical equivalence is about is making non-distinctions
    > in the *interpretation* of equivalent sequences. No Unicode-
    > conformant process should assume that another process will
    > systematically distinguish a meaningful interpretation
    > difference between <U+00E9> "é" and <U+0065, U+0301> "é" --
    > they both represent the *same* abstract character, namely
    > an e-acute.

    Just to wrap up the discussion we had last week on compression:

    For me at least, this settles it. Compression engines generally operate
    at a level where strings of encoded characters, not their
    interpretation, are at issue. Differences between strings that are due
    to normalization are not relevant for interpretation, but may be very
    relevant for other factors, like string length and checksums.

    That being the case, it would *not* generally be appropriate for a
    compressor to normalize its input text. To do so would be to introduce
    differences at a level where there should be none.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 01:59:28 EST