Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Wed Mar 10 2010 - 13:27:48 CST

  • Next message: Kenneth Whistler: "Re: property, character, and sequence name loose matching"

    Kenneth Whistler wrote:
    > Karl Williamson asked:
    >
    >> The loose matching rules in TR18 say to ignore white space, underscores,
    >> and hyphens. That means that someone could insert white space into the
    >> middle of what is supposed to be a single word, like
    >> \p{s c r i p t: greek}. Same for character names.
    >
    > Actually, it doesn't mean that you can arbitrarily ignore
    > the identifier syntax of particular formalizations.
    I don't understand your sentence. I'm guessing you mean that
    's c r i p t' is not the same as 'script', even though tr18 says "case
    distinctions, whitespace, hyphens, and underbar are ignored." If so,
    shouldn't tr18 be clarified?
    >
    > What it means is that if you are matching particular
    > property values from the Unicode Character Database,
    > then such strings as "right above", "right_above" and "rightabove"
    > (as well as case permutations such as "Right Above", "RIGHT_ABOVE",
    > etc.) should all be considered as matching each other.
    >
    >> Someone has pointed out to me that UAX34 says this: "Like character
    >> names, names for sequences are unique if they are different even when
    >> SPACE and medial HYPHEN-MINUS characters are ignored". The term
    >> "medial" isn't in TR18. That same someone pointed out that if you can
    >> have spaces between characters in a word, that means the concept of
    >> "medial" is meaningless.
    >
    > If you assume counterfactual premises, you can prove anything
    > to be meaningless.
    >
    >> Please explain what was meant.
    >
    > What it means is that such names as:
    >
    > CHARACTER BZZT
    > CHARACTER B-ZZ-T
    > CHARACTER BZ-ZT

    What about
    CHARACER BZ--ZT
    ?

    >
    > would be considered matches. And because they are matches
    > by the loose matching rules for names and named sequences,
    > the UTC is careful to ensure that different characters are
    > not given such names, precisely because they are not considered
    > distinct.
    >
    > CHARACTER BZZT
    > CHARACTER BZZT-
    > CHARACTER -BZZT
    >
    > would *NOT* be considered matches. So in principle it would
    > be possible to have three different characters encoded with
    > those three names.
    >
    > In practice the UTC doesn't actually use names like those,
    > but there are a few Tibetan naming conventions that slipped
    > in early on -- which is the reason for allowing non-medial hyphens
    > in names (and keeping them distinct). To wit:
    >
    > U+0F60 TIBETAN LETTER -A
    > U+0F68 TIBETAN LETTER A
    >
    > Those do *not* match.
    >
    > On the other hand, there is an exception written into the name
    > matching rule because of some Korean Hangul characters. In
    > particular:
    >
    > U+116C HANGUL JUNGESONG OE
    > U+1180 HANGUL JUNGSEONG O-E
    >
    > also do *not* match. But in that case, it is a matter of
    > particular exception, rather than general rule.
    >
    > --Ken
    >



    This archive was generated by hypermail 2.1.5 : Wed Mar 10 2010 - 13:32:55 CST