Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Wed Mar 10 2010 - 13:27:48 CST

Next message: Kenneth Whistler: "Re: property, character, and sequence name loose matching"

Previous message: David Starner: "ß vs. ſs"
In reply to: Kenneth Whistler: "Re: property, character, and sequence name loose matching"
Next in thread: Kenneth Whistler: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kenneth Whistler wrote:
> Karl Williamson asked:
>
>> The loose matching rules in TR18 say to ignore white space, underscores,
>> and hyphens. That means that someone could insert white space into the
>> middle of what is supposed to be a single word, like
>> \p{s c r i p t: greek}. Same for character names.
>
> Actually, it doesn't mean that you can arbitrarily ignore
> the identifier syntax of particular formalizations.
I don't understand your sentence. I'm guessing you mean that
's c r i p t' is not the same as 'script', even though tr18 says "case
distinctions, whitespace, hyphens, and underbar are ignored." If so,
shouldn't tr18 be clarified?
>
> What it means is that if you are matching particular
> property values from the Unicode Character Database,
> then such strings as "right above", "right_above" and "rightabove"
> (as well as case permutations such as "Right Above", "RIGHT_ABOVE",
> etc.) should all be considered as matching each other.
>
>> Someone has pointed out to me that UAX34 says this: "Like character
>> names, names for sequences are unique if they are different even when
>> SPACE and medial HYPHEN-MINUS characters are ignored". The term
>> "medial" isn't in TR18. That same someone pointed out that if you can
>> have spaces between characters in a word, that means the concept of
>> "medial" is meaningless.
>
> If you assume counterfactual premises, you can prove anything
> to be meaningless.
>
>> Please explain what was meant.
>
> What it means is that such names as:
>
> CHARACTER BZZT
> CHARACTER B-ZZ-T
> CHARACTER BZ-ZT

What about
CHARACER BZ--ZT
?

>
> would be considered matches. And because they are matches
> by the loose matching rules for names and named sequences,
> the UTC is careful to ensure that different characters are
> not given such names, precisely because they are not considered
> distinct.
>
> CHARACTER BZZT
> CHARACTER BZZT-
> CHARACTER -BZZT
>
> would *NOT* be considered matches. So in principle it would
> be possible to have three different characters encoded with
> those three names.
>
> In practice the UTC doesn't actually use names like those,
> but there are a few Tibetan naming conventions that slipped
> in early on -- which is the reason for allowing non-medial hyphens
> in names (and keeping them distinct). To wit:
>
> U+0F60 TIBETAN LETTER -A
> U+0F68 TIBETAN LETTER A
>
> Those do *not* match.
>
> On the other hand, there is an exception written into the name
> matching rule because of some Korean Hangul characters. In
> particular:
>
> U+116C HANGUL JUNGESONG OE
> U+1180 HANGUL JUNGSEONG O-E
>
> also do *not* match. But in that case, it is a matter of
> particular exception, rather than general rule.
>
> --Ken
>

Next message: Kenneth Whistler: "Re: property, character, and sequence name loose matching"
Previous message: David Starner: "ß vs. ſs"
In reply to: Kenneth Whistler: "Re: property, character, and sequence name loose matching"
Next in thread: Kenneth Whistler: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Mar 10 2010 - 13:32:55 CST