Re: property, character, and sequence name loose matching

From: Mark Davis ☕ (mark@macchiato.com)
Date: Thu Mar 11 2010 - 12:34:16 CST

Next message: Asmus Freytag: "Re: property, character, and sequence name loose matching"

Previous message: philip chastney: "Fw: Re: ß vs. ſs"
In reply to: karl williamson: "Re: property, character, and sequence name loose matching"
Next in thread: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I agree that the wording should be clearer. What is meant by

UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens
except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

is that when matching two strings, transform each in the following way.

1. remove all hyphens that are medial (except in U+1180) then
2. remove whitespace and underscore, and lowercase.

If after these transforms, the two strings are the same, then they match.

This is a logical statement: you can do the transformations in a single pass
if you are careful, and you also can do the comparison while transforming
incrementally.

Mark

On Thu, Mar 11, 2010 at 09:34, karl williamson <public@khwilliamson.com>wrote:

> Kenneth Whistler wrote:
>
>> The loose matching rules in TR18 say to ignore white space, underscores,
>>>>> and hyphens. That means that someone could insert white space into the
>>>>> middle of what is supposed to be a single word, like
>>>>> \p{s c r i p t: greek}. Same for character names.
>>>>>
>>>> Actually, it doesn't mean that you can arbitrarily ignore
>>>> the identifier syntax of particular formalizations.
>>>>
>>> I don't understand your sentence. I'm guessing you mean that
>>> 's c r i p t' is not the same as 'script', even though tr18 says "case
>>> distinctions, whitespace, hyphens, and underbar are ignored." If so,
>>> shouldn't tr18 be clarified?
>>>
>>
>> I should have said "pattern syntax" rather than "identifier syntax"
>> in this case, but the point is that while UTS #18 makes
>> a general statement about how pattern matching for property
>> names and values should be done, you still have to pay attention
>> to the details of the actual implementations.
>>
>> Without checking an actual implementation of java.util.regex Class
>> Pattern, I don't know whether:
>>
>> \p{_________ -------s c r i p________--_- t ___: greek}
>>
>> would actually match the Unicode Script property or would
>> throw a PatternSyntaxException.
>>
>> You can try it and find out, I suppose. But that isn't
>> really so much an issue for UTS #18 but rather something to take
>> up with the implementers of Java, Perl, and other regex
>> engines.
>>
>>
> The reason I'm asking this is that I am an implementer of Perl's regex
> engine. I didn't realize that that fact would be germane to my question, so
> I didn't mention it. Sorry. I'm not interested in what's advisable or not
> to use; I'm interested in what the engine should accept versus throw an
> exception on, and hence how I need to write the engine. So I am seeking
> clarification of what TUS would like from an implementation.
>
> In the past Perl has not accepted the full loose matching rules, but now I
> have implemented what I thought were them for the soon-to-be-released Perl
> 5.12. Perl 5 is an open-source project; I am a volunteer with some
> background and interest in the topic, but not an expert. I am, however, an
> expert software developer, retired now, so I have some time to devote to
> this.
>
> Based on my reading of TR18 and UAX44, I changed the Perl regex engine so
> it would parse things like what Ken mentioned above:
>
> \p{_________ -------s c r i p________--_- t ___: greek}
> as meaning \p{script:greek}, without throwing an exception. Again, it's
> not advisable for someone to write something like that, but it appears to me
> to be permissible, and so I wrote the regex engine to handle it.
>
> I am starting out to add loose matching to the regex engine for character
> names for the next release of Perl 5 (and I anticipate adding support for
> named sequences in Perl by then, so for them as well).
>
> Effectively, it was pointed out that my reading of what I thought was the
> plain wording of the standard might be wrong, since, if there can be a space
> between any two characters, the concept of word is meaningless, and
> therefore the concept of a medial hyphen is as well. Conversely, if words
> can be run-on together, all hyphens (except at the very beginning and end of
> the string) become medial, and so the distinction is also meaningless.
>
>
> What it means is that such names as:
>>>>
>>>> CHARACTER BZZT
>>>> CHARACTER B-ZZ-T
>>>> CHARACTER BZ-ZT
>>>>
>>> What about
>>> CHARACER BZ--ZT
>>> ?
>>>
>>
>> What about it?
>>
>> "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
>> the first one is missing the "T" in "CHARACTER". But then,
>> I don't suppose that was your question.
>>
>
> Sorry for the typo, and thanks for figuring out what I really meant.
>
>
>> The loose matching rules would not distinguish:
>>
>> CHARACTER BZZT
>>
>> from
>>
>> CHARACTER BZ--ZT
>>
>> or for that matter, from
>>
>> CHARACTER BZ---------------------------------------------------ZT
>>
>> But if your question is, rather, would "CHARACTER BZ--ZT" be
>> allowed as a Unicode character name, the answer is no.
>> But the reason for that cannot be found in UTS #18. The reason
>> is because it would be stupid and pointless to name a character that way,
>> and the folks in the relevant maintenance committees are not
>> stupid.
>>
>
> Of course
>
>
>> In general, if there is something unclear about matching rules
>> in the Unicode Standard, a more fruitful direction would be to
>> examine the relevant text in the proposed update for UAX #44
>> and suggest any required clarifications to the UTC, if there
>> really is an issue of ambiguity in that text. See:
>>
>> http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
>>
>> --Ken
>>
>>
> Implementers need highly precise wording in a standard. So this sentence
> in the current UAX44 draft (thanks for the link) is problematic for me:
>
> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>
> If whitespace is ignored, then all hyphens are medial, and as tr18 points
> out, there would then be two other confusable cases, involving what you
> might think of as "initial" hyphens.
>
> So, I'm in a hurry. I don't have time to wait for the next draft of UAX44.
> Perl 5.12 is in a code freeze. If I misread what you guys intended, it
> would be good if I knew immediately, so I could go and plead that the
> revisions I would have to write be allowed in so that the defective version
> would never get published.
>
> My sense, though, is that I didn't misread it, that the statements made in
> UAX34 and 44 are imprecise, and based on your responses to this email, I
> will submit an official report through your website.
>
>>
>>
>>
>

Next message: Asmus Freytag: "Re: property, character, and sequence name loose matching"
Previous message: philip chastney: "Fw: Re: ß vs. ſs"
In reply to: karl williamson: "Re: property, character, and sequence name loose matching"
Next in thread: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 12:37:37 CST