Re: property, character, and sequence name loose matching

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Mar 11 2010 - 14:10:32 CST

  • Next message: karl williamson: "Re: property, character, and sequence name loose matching"

    On 3/11/2010 11:45 AM, karl williamson wrote:
    > Mark Davis ☕ wrote:
    >> I agree that the wording should be clearer. What is meant by
    >>
    >> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    >> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >>
    >>
    >> is that when matching two strings, transform each in the following way.
    >>
    >> 1. remove all hyphens that are medial (except in U+1180) then
    >> 2. remove whitespace and underscore, and lowercase.
    >>
    >> If after these transforms, the two strings are the same, then they
    >> match.
    >>
    >> This is a logical statement: you can do the transformations in a
    >> single pass if you are careful, and you also can do the comparison
    >> while transforming incrementally.
    >>
    >> Mark
    >>
    >
    > Ok. Thank you. That's totally clear and implementable. I just want
    > to be sure that you realize that this means that if the user writes
    > TIBETAN LETTER-A
    >
    > the rules above yield
    > tibetanlettera
    Correct, the hyphen, being medial, is removed.
    >
    > which maps to
    > TIBETAN LETTER A
    >
    > and not to what they more likely meant
    > TIBETAN LETTER -A
    LETTER-A is indeed the same as LETTER A

    If you want LETTER -A you need to retain the hyphen, and at least one space

    L E T T E R -A

    would match
    >
    > So therefore in this (and in TIBETAN SUBJOINED LETTER -A) the white
    > space before the '-' is significant, and that isn't mentioned in the
    > documents, except tr18.
    Correct, an eample to that effect in UAX#44 would help clarify the
    impact of the word "medial" in the rules.

    A./
    >
    >>
    >> On Thu, Mar 11, 2010 at 09:34, karl williamson
    >> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote:
    >>
    >> Kenneth Whistler wrote:
    >>
    >> The loose matching rules in TR18 say to ignore white
    >> space, underscores, and hyphens. That means that
    >> someone could insert white space into the middle of
    >> what is supposed to be a single word, like
    >> \p{s c r i p t: greek}. Same for character names.
    >>
    >> Actually, it doesn't mean that you can arbitrarily
    >> ignore
    >> the identifier syntax of particular formalizations.
    >>
    >> I don't understand your sentence. I'm guessing you mean
    >> that
    >> 's c r i p t' is not the same as 'script', even though tr18
    >> says "case distinctions, whitespace, hyphens, and underbar
    >> are ignored." If so, shouldn't tr18 be clarified?
    >>
    >>
    >> I should have said "pattern syntax" rather than "identifier
    >> syntax"
    >> in this case, but the point is that while UTS #18 makes
    >> a general statement about how pattern matching for property
    >> names and values should be done, you still have to pay attention
    >> to the details of the actual implementations.
    >>
    >> Without checking an actual implementation of java.util.regex
    >> Class
    >> Pattern, I don't know whether:
    >>
    >> \p{_________ -------s c r i p________--_- t ___:
    >> greek}
    >>
    >> would actually match the Unicode Script property or would
    >> throw a PatternSyntaxException.
    >>
    >> You can try it and find out, I suppose. But that isn't
    >> really so much an issue for UTS #18 but rather something to take
    >> up with the implementers of Java, Perl, and other regex
    >> engines.
    >>
    >>
    >> The reason I'm asking this is that I am an implementer of Perl's
    >> regex engine. I didn't realize that that fact would be germane to
    >> my question, so I didn't mention it. Sorry. I'm not interested in
    >> what's advisable or not to use; I'm interested in what the engine
    >> should accept versus throw an exception on, and hence how I need to
    >> write the engine. So I am seeking clarification of what TUS would
    >> like from an implementation.
    >>
    >> In the past Perl has not accepted the full loose matching rules, but
    >> now I have implemented what I thought were them for the
    >> soon-to-be-released Perl 5.12. Perl 5 is an open-source project; I
    >> am a volunteer with some background and interest in the topic, but
    >> not an expert. I am, however, an expert software developer, retired
    >> now, so I have some time to devote to this.
    >>
    >> Based on my reading of TR18 and UAX44, I changed the Perl regex
    >> engine so it would parse things like what Ken mentioned above:
    >>
    >> \p{_________ -------s c r i p________--_- t ___:
    >> greek}
    >> as meaning \p{script:greek}, without throwing an exception. Again,
    >> it's not advisable for someone to write something like that, but it
    >> appears to me to be permissible, and so I wrote the regex engine to
    >> handle it.
    >>
    >> I am starting out to add loose matching to the regex engine for
    >> character names for the next release of Perl 5 (and I anticipate
    >> adding support for named sequences in Perl by then, so for them as
    >> well).
    >>
    >> Effectively, it was pointed out that my reading of what I thought
    >> was the plain wording of the standard might be wrong, since, if
    >> there can be a space between any two characters, the concept of word
    >> is meaningless, and therefore the concept of a medial hyphen is as
    >> well. Conversely, if words can be run-on together, all hyphens
    >> (except at the very beginning and end of the string) become medial,
    >> and so the distinction is also meaningless.
    >>
    >>
    >> What it means is that such names as:
    >>
    >> CHARACTER BZZT
    >> CHARACTER B-ZZ-T
    >> CHARACTER BZ-ZT
    >>
    >> What about
    >> CHARACER BZ--ZT
    >> ?
    >>
    >>
    >> What about it?
    >>
    >> "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
    >> the first one is missing the "T" in "CHARACTER". But then,
    >> I don't suppose that was your question.
    >>
    >>
    >> Sorry for the typo, and thanks for figuring out what I really meant.
    >>
    >>
    >> The loose matching rules would not distinguish:
    >>
    >> CHARACTER BZZT
    >>
    >> from
    >>
    >> CHARACTER BZ--ZT
    >>
    >> or for that matter, from
    >>
    >> CHARACTER
    >> BZ---------------------------------------------------ZT
    >>
    >> But if your question is, rather, would "CHARACTER BZ--ZT" be
    >> allowed as a Unicode character name, the answer is no.
    >> But the reason for that cannot be found in UTS #18. The reason
    >> is because it would be stupid and pointless to name a character
    >> that way,
    >> and the folks in the relevant maintenance committees are not
    >> stupid.
    >>
    >>
    >> Of course
    >>
    >>
    >> In general, if there is something unclear about matching rules
    >> in the Unicode Standard, a more fruitful direction would be to
    >> examine the relevant text in the proposed update for UAX #44
    >> and suggest any required clarifications to the UTC, if there
    >> really is an issue of ambiguity in that text. See:
    >>
    >> http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
    >>
    >> --Ken
    >>
    >>
    >> Implementers need highly precise wording in a standard. So this
    >> sentence in the current UAX44 draft (thanks for the link) is
    >> problematic for me:
    >>
    >> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    >> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >>
    >> If whitespace is ignored, then all hyphens are medial, and as tr18
    >> points out, there would then be two other confusable cases,
    >> involving what you might think of as "initial" hyphens.
    >>
    >> So, I'm in a hurry. I don't have time to wait for the next draft of
    >> UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
    >> intended, it would be good if I knew immediately, so I could go and
    >> plead that the revisions I would have to write be allowed in so that
    >> the defective version would never get published.
    >>
    >> My sense, though, is that I didn't misread it, that the statements
    >> made in UAX34 and 44 are imprecise, and based on your responses to
    >> this email, I will submit an official report through your website.
    >>
    >>
    >>
    >>
    >>
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 14:12:27 CST