From: Doug Ewell (dewell@adelphia.net)
Date: Fri Dec 12 2003 - 01:13:13 EST
Kenneth Whistler <kenw at sybase dot com> wrote:
> It is perfectly conformant with the Unicode Standard to assert
> that <U+00E9> "é" and <U+0065, U+0301> "é" are different
> Unicode strings. They *are* different Unicode strings. They
> contain different encoded characters, and they have different
> lengths.
> ...
> What canonical equivalence is about is making non-distinctions
> in the *interpretation* of equivalent sequences. No Unicode-
> conformant process should assume that another process will
> systematically distinguish a meaningful interpretation
> difference between <U+00E9> "é" and <U+0065, U+0301> "é" --
> they both represent the *same* abstract character, namely
> an e-acute.
Just to wrap up the discussion we had last week on compression:
For me at least, this settles it. Compression engines generally operate
at a level where strings of encoded characters, not their
interpretation, are at issue. Differences between strings that are due
to normalization are not relevant for interpretation, but may be very
relevant for other factors, like string length and checksums.
That being the case, it would *not* generally be appropriate for a
compressor to normalize its input text. To do so would be to introduce
differences at a level where there should be none.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 01:59:28 EST