Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Thu Mar 11 2010 - 13:45:49 CST

  • Next message: Asmus Freytag: "Re: property, character, and sequence name loose matching"

    Mark Davis ☕ wrote:
    > I agree that the wording should be clearer. What is meant by
    >
    > UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    > hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >
    >
    > is that when matching two strings, transform each in the following way.
    >
    > 1. remove all hyphens that are medial (except in U+1180) then
    > 2. remove whitespace and underscore, and lowercase.
    >
    > If after these transforms, the two strings are the same, then they match.
    >
    > This is a logical statement: you can do the transformations in a single
    > pass if you are careful, and you also can do the comparison while
    > transforming incrementally.
    >
    > Mark
    >

    Ok. Thank you. That's totally clear and implementable. I just want to
    be sure that you realize that this means that if the user writes
    TIBETAN LETTER-A

    the rules above yield
    tibetanlettera

    which maps to
    TIBETAN LETTER A

    and not to what they more likely meant
    TIBETAN LETTER -A

    So therefore in this (and in TIBETAN SUBJOINED LETTER -A) the white
    space before the '-' is significant, and that isn't mentioned in the
    documents, except tr18.

    >
    > On Thu, Mar 11, 2010 at 09:34, karl williamson <public@khwilliamson.com
    > <mailto:public@khwilliamson.com>> wrote:
    >
    > Kenneth Whistler wrote:
    >
    > The loose matching rules in TR18 say to ignore white
    > space, underscores, and hyphens. That means that
    > someone could insert white space into the middle of
    > what is supposed to be a single word, like
    > \p{s c r i p t: greek}. Same for character names.
    >
    > Actually, it doesn't mean that you can arbitrarily ignore
    > the identifier syntax of particular formalizations.
    >
    > I don't understand your sentence. I'm guessing you mean that
    > 's c r i p t' is not the same as 'script', even though tr18
    > says "case distinctions, whitespace, hyphens, and underbar
    > are ignored." If so, shouldn't tr18 be clarified?
    >
    >
    > I should have said "pattern syntax" rather than "identifier syntax"
    > in this case, but the point is that while UTS #18 makes
    > a general statement about how pattern matching for property
    > names and values should be done, you still have to pay attention
    > to the details of the actual implementations.
    >
    > Without checking an actual implementation of java.util.regex Class
    > Pattern, I don't know whether:
    >
    > \p{_________ -------s c r i p________--_- t ___:
    > greek}
    >
    > would actually match the Unicode Script property or would
    > throw a PatternSyntaxException.
    >
    > You can try it and find out, I suppose. But that isn't
    > really so much an issue for UTS #18 but rather something to take
    > up with the implementers of Java, Perl, and other regex
    > engines.
    >
    >
    > The reason I'm asking this is that I am an implementer of Perl's
    > regex engine. I didn't realize that that fact would be germane to
    > my question, so I didn't mention it. Sorry. I'm not interested in
    > what's advisable or not to use; I'm interested in what the engine
    > should accept versus throw an exception on, and hence how I need to
    > write the engine. So I am seeking clarification of what TUS would
    > like from an implementation.
    >
    > In the past Perl has not accepted the full loose matching rules, but
    > now I have implemented what I thought were them for the
    > soon-to-be-released Perl 5.12. Perl 5 is an open-source project; I
    > am a volunteer with some background and interest in the topic, but
    > not an expert. I am, however, an expert software developer, retired
    > now, so I have some time to devote to this.
    >
    > Based on my reading of TR18 and UAX44, I changed the Perl regex
    > engine so it would parse things like what Ken mentioned above:
    >
    > \p{_________ -------s c r i p________--_- t ___: greek}
    > as meaning \p{script:greek}, without throwing an exception. Again,
    > it's not advisable for someone to write something like that, but it
    > appears to me to be permissible, and so I wrote the regex engine to
    > handle it.
    >
    > I am starting out to add loose matching to the regex engine for
    > character names for the next release of Perl 5 (and I anticipate
    > adding support for named sequences in Perl by then, so for them as
    > well).
    >
    > Effectively, it was pointed out that my reading of what I thought
    > was the plain wording of the standard might be wrong, since, if
    > there can be a space between any two characters, the concept of word
    > is meaningless, and therefore the concept of a medial hyphen is as
    > well. Conversely, if words can be run-on together, all hyphens
    > (except at the very beginning and end of the string) become medial,
    > and so the distinction is also meaningless.
    >
    >
    > What it means is that such names as:
    >
    > CHARACTER BZZT
    > CHARACTER B-ZZ-T
    > CHARACTER BZ-ZT
    >
    > What about
    > CHARACER BZ--ZT
    > ?
    >
    >
    > What about it?
    >
    > "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
    > the first one is missing the "T" in "CHARACTER". But then,
    > I don't suppose that was your question.
    >
    >
    > Sorry for the typo, and thanks for figuring out what I really meant.
    >
    >
    > The loose matching rules would not distinguish:
    >
    > CHARACTER BZZT
    >
    > from
    >
    > CHARACTER BZ--ZT
    >
    > or for that matter, from
    >
    > CHARACTER BZ---------------------------------------------------ZT
    >
    > But if your question is, rather, would "CHARACTER BZ--ZT" be
    > allowed as a Unicode character name, the answer is no.
    > But the reason for that cannot be found in UTS #18. The reason
    > is because it would be stupid and pointless to name a character
    > that way,
    > and the folks in the relevant maintenance committees are not
    > stupid.
    >
    >
    > Of course
    >
    >
    > In general, if there is something unclear about matching rules
    > in the Unicode Standard, a more fruitful direction would be to
    > examine the relevant text in the proposed update for UAX #44
    > and suggest any required clarifications to the UTC, if there
    > really is an issue of ambiguity in that text. See:
    >
    > http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
    >
    > --Ken
    >
    >
    > Implementers need highly precise wording in a standard. So this
    > sentence in the current UAX44 draft (thanks for the link) is
    > problematic for me:
    >
    > UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    > hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >
    > If whitespace is ignored, then all hyphens are medial, and as tr18
    > points out, there would then be two other confusable cases,
    > involving what you might think of as "initial" hyphens.
    >
    > So, I'm in a hurry. I don't have time to wait for the next draft of
    > UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
    > intended, it would be good if I knew immediately, so I could go and
    > plead that the revisions I would have to write be allowed in so that
    > the defective version would never get published.
    >
    > My sense, though, is that I didn't misread it, that the statements
    > made in UAX34 and 44 are imprecise, and based on your responses to
    > this email, I will submit an official report through your website.
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 13:48:33 CST