Re: Medievalist ligature character in the PUA

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Dec 14 2009 - 19:53:10 CST

  • Next message: verdy_p: "Re: Medievalist ligature character in the PUA"

    On 12/14/2009 1:35 PM, Michael Everson wrote:
    > On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
    >
    >> On 2009-12-14, Michael Everson <everson@evertype.com> wrote:
    >>> I agree. Canonical equivalence is identity.
    >>
    >> That's a nonsensical statement. Well, actually it's not nonsensical,
    >> it's just plain wrong.
    >> Everybody who uses the word "identity" in a technical sense knows
    >> what it means, and it doesn't mean "has different bytes".
    >
    > Evidently I was not using it in a technical sense.
    >
    >> What you presumably mean is "the space in which filenames live
    >> *ought* to be the set of utf-8 strings quotiented by canonical
    >> equivalence" (so that two canonically equivalent strings are
    >> representatives of one and the same filename).
    >
    > No, that's not what I meant.
    >
    > I meant that é 00E9 and é 0065 0301 the same platonic entity (acute e)
    > in an intrinsic sense, whereas both are different from a Cyrillic
    > lookalike, е́ 0435 0301.
    >
    > *That* kind of identity.
    Which, formally, is an equivalence, hence Unicode's term: "canonically
    equivalent" - which separates it out from myriads of other possible ways
    under which two code sequences could be considered equivalent by
    different user communities.

    What people confuse here, and what you were trying to address with your
    "platonic entity" is that there is a distinction between the character
    (abstract character) and its encoding in actual data (code unit sequence).

    Because of the UTFs, Unicode has at least three levels. The abstract
    character, the numeric value (the integer between 0 and 1114109) and the
    bytes, words and double-words of the encoding form.

    That two different sequences of code units refer to the same coded
    character is usually taken in stride, because the mappings are lossless.
    That more than one code sequence can refer to the same abstract
    character is problematic, because there's a choice when going from
    abstract character to encoding.

    But what is the correct level for allowing users to make differences in
    naming objects on a file system? Logically, it is the abstract
    character, even if for various reasons of engineering that has not
    happened. Some systems go further, and apply other equivalences (case,
    mostly), but at that moment you leave the abstraction level of the
    encoding and enter the realm of convention.

     From there, to "religious" wars, is a short step.

    A./



    This archive was generated by hypermail 2.1.5 : Mon Dec 14 2009 - 19:54:17 CST