Re: property, character, and sequence name loose matching

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Mar 11 2010 - 14:10:32 CST

Next message: karl williamson: "Re: property, character, and sequence name loose matching"

Previous message: karl williamson: "Re: property, character, and sequence name loose matching"
In reply to: karl williamson: "Re: property, character, and sequence name loose matching"
Next in thread: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 3/11/2010 11:45 AM, karl williamson wrote:
> Mark Davis ☕ wrote:
>> I agree that the wording should be clearer. What is meant by
>>
>> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
>> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>>
>>
>> is that when matching two strings, transform each in the following way.
>>
>> 1. remove all hyphens that are medial (except in U+1180) then
>> 2. remove whitespace and underscore, and lowercase.
>>
>> If after these transforms, the two strings are the same, then they
>> match.
>>
>> This is a logical statement: you can do the transformations in a
>> single pass if you are careful, and you also can do the comparison
>> while transforming incrementally.
>>
>> Mark
>>
>
> Ok. Thank you. That's totally clear and implementable. I just want
> to be sure that you realize that this means that if the user writes
> TIBETAN LETTER-A
>
> the rules above yield
> tibetanlettera
Correct, the hyphen, being medial, is removed.
>
> which maps to
> TIBETAN LETTER A
>
> and not to what they more likely meant
> TIBETAN LETTER -A
LETTER-A is indeed the same as LETTER A

If you want LETTER -A you need to retain the hyphen, and at least one space

L E T T E R -A

would match
>
> So therefore in this (and in TIBETAN SUBJOINED LETTER -A) the white
> space before the '-' is significant, and that isn't mentioned in the
> documents, except tr18.
Correct, an eample to that effect in UAX#44 would help clarify the
impact of the word "medial" in the rules.

A./
>
>>
>> On Thu, Mar 11, 2010 at 09:34, karl williamson
>> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote:
>>
>> Kenneth Whistler wrote:
>>
>> The loose matching rules in TR18 say to ignore white
>> space, underscores, and hyphens. That means that
>> someone could insert white space into the middle of
>> what is supposed to be a single word, like
>> \p{s c r i p t: greek}. Same for character names.
>>
>> Actually, it doesn't mean that you can arbitrarily
>> ignore
>> the identifier syntax of particular formalizations.
>>
>> I don't understand your sentence. I'm guessing you mean
>> that
>> 's c r i p t' is not the same as 'script', even though tr18
>> says "case distinctions, whitespace, hyphens, and underbar
>> are ignored." If so, shouldn't tr18 be clarified?
>>
>>
>> I should have said "pattern syntax" rather than "identifier
>> syntax"
>> in this case, but the point is that while UTS #18 makes
>> a general statement about how pattern matching for property
>> names and values should be done, you still have to pay attention
>> to the details of the actual implementations.
>>
>> Without checking an actual implementation of java.util.regex
>> Class
>> Pattern, I don't know whether:
>>
>> \p{_________ -------s c r i p________--_- t ___:
>> greek}
>>
>> would actually match the Unicode Script property or would
>> throw a PatternSyntaxException.
>>
>> You can try it and find out, I suppose. But that isn't
>> really so much an issue for UTS #18 but rather something to take
>> up with the implementers of Java, Perl, and other regex
>> engines.
>>
>>
>> The reason I'm asking this is that I am an implementer of Perl's
>> regex engine. I didn't realize that that fact would be germane to
>> my question, so I didn't mention it. Sorry. I'm not interested in
>> what's advisable or not to use; I'm interested in what the engine
>> should accept versus throw an exception on, and hence how I need to
>> write the engine. So I am seeking clarification of what TUS would
>> like from an implementation.
>>
>> In the past Perl has not accepted the full loose matching rules, but
>> now I have implemented what I thought were them for the
>> soon-to-be-released Perl 5.12. Perl 5 is an open-source project; I
>> am a volunteer with some background and interest in the topic, but
>> not an expert. I am, however, an expert software developer, retired
>> now, so I have some time to devote to this.
>>
>> Based on my reading of TR18 and UAX44, I changed the Perl regex
>> engine so it would parse things like what Ken mentioned above:
>>
>> \p{_________ -------s c r i p________--_- t ___:
>> greek}
>> as meaning \p{script:greek}, without throwing an exception. Again,
>> it's not advisable for someone to write something like that, but it
>> appears to me to be permissible, and so I wrote the regex engine to
>> handle it.
>>
>> I am starting out to add loose matching to the regex engine for
>> character names for the next release of Perl 5 (and I anticipate
>> adding support for named sequences in Perl by then, so for them as
>> well).
>>
>> Effectively, it was pointed out that my reading of what I thought
>> was the plain wording of the standard might be wrong, since, if
>> there can be a space between any two characters, the concept of word
>> is meaningless, and therefore the concept of a medial hyphen is as
>> well. Conversely, if words can be run-on together, all hyphens
>> (except at the very beginning and end of the string) become medial,
>> and so the distinction is also meaningless.
>>
>>
>> What it means is that such names as:
>>
>> CHARACTER BZZT
>> CHARACTER B-ZZ-T
>> CHARACTER BZ-ZT
>>
>> What about
>> CHARACER BZ--ZT
>> ?
>>
>>
>> What about it?
>>
>> "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
>> the first one is missing the "T" in "CHARACTER". But then,
>> I don't suppose that was your question.
>>
>>
>> Sorry for the typo, and thanks for figuring out what I really meant.
>>
>>
>> The loose matching rules would not distinguish:
>>
>> CHARACTER BZZT
>>
>> from
>>
>> CHARACTER BZ--ZT
>>
>> or for that matter, from
>>
>> CHARACTER
>> BZ---------------------------------------------------ZT
>>
>> But if your question is, rather, would "CHARACTER BZ--ZT" be
>> allowed as a Unicode character name, the answer is no.
>> But the reason for that cannot be found in UTS #18. The reason
>> is because it would be stupid and pointless to name a character
>> that way,
>> and the folks in the relevant maintenance committees are not
>> stupid.
>>
>>
>> Of course
>>
>>
>> In general, if there is something unclear about matching rules
>> in the Unicode Standard, a more fruitful direction would be to
>> examine the relevant text in the proposed update for UAX #44
>> and suggest any required clarifications to the UTC, if there
>> really is an issue of ambiguity in that text. See:
>>
>> http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
>>
>> --Ken
>>
>>
>> Implementers need highly precise wording in a standard. So this
>> sentence in the current UAX44 draft (thanks for the link) is
>> problematic for me:
>>
>> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
>> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>>
>> If whitespace is ignored, then all hyphens are medial, and as tr18
>> points out, there would then be two other confusable cases,
>> involving what you might think of as "initial" hyphens.
>>
>> So, I'm in a hurry. I don't have time to wait for the next draft of
>> UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
>> intended, it would be good if I knew immediately, so I could go and
>> plead that the revisions I would have to write be allowed in so that
>> the defective version would never get published.
>>
>> My sense, though, is that I didn't misread it, that the statements
>> made in UAX34 and 44 are imprecise, and based on your responses to
>> this email, I will submit an official report through your website.
>>
>>
>>
>>
>>
>
>

Next message: karl williamson: "Re: property, character, and sequence name loose matching"
Previous message: karl williamson: "Re: property, character, and sequence name loose matching"
In reply to: karl williamson: "Re: property, character, and sequence name loose matching"
Next in thread: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 14:12:27 CST