    Mark Davis ☕ wrote:
    > I agree that the wording should be clearer. What is meant by
    > UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    > hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    > is that when matching two strings, transform each in the following way.
    > 1. remove all hyphens that are medial (except in U+1180) then
    > 2. remove whitespace and underscore, and lowercase.
    > If after these transforms, the two strings are the same, then they match.
    > This is a logical statement: you can do the transformations in a single
    > pass if you are careful, and you also can do the comparison while
    > transforming incrementally.
    > Mark

    Ok. Thank you. That's totally clear and implementable. I just want to
    be sure that you realize that this means that if the user writes

    the rules above yield

    which maps to

    and not to what they more likely meant

    So therefore in this (and in TIBETAN SUBJOINED LETTER -A) the white
    space before the '-' is significant, and that isn't mentioned in the
    documents, except tr18.

    > On Thu, Mar 11, 2010 at 09:34, karl williamson <
    > <>> wrote:
    > Kenneth Whistler wrote:
    > The loose matching rules in TR18 say to ignore white
    > space, underscores, and hyphens. That means that
    > someone could insert white space into the middle of
    > what is supposed to be a single word, like
    > \p{s c r i p t: greek}. Same for character names.
    > Actually, it doesn't mean that you can arbitrarily ignore
    > the identifier syntax of particular formalizations.
    > I don't understand your sentence. I'm guessing you mean that
    > 's c r i p t' is not the same as 'script', even though tr18
    > says "case distinctions, whitespace, hyphens, and underbar
    > are ignored." If so, shouldn't tr18 be clarified?
    > I should have said "pattern syntax" rather than "identifier syntax"
    > in this case, but the point is that while UTS #18 makes
    > a general statement about how pattern matching for property
    > names and values should be done, you still have to pay attention
    > to the details of the actual implementations.
    > Without checking an actual implementation of java.util.regex Class
    > Pattern, I don't know whether:
    > \p{_________ -------s c r i p________--_- t ___:
    > greek}
    > would actually match the Unicode Script property or would
    > throw a PatternSyntaxException.
    > You can try it and find out, I suppose. But that isn't
    > really so much an issue for UTS #18 but rather something to take
    > up with the implementers of Java, Perl, and other regex
    > engines.
    > The reason I'm asking this is that I am an implementer of Perl's
    > regex engine. I didn't realize that that fact would be germane to
    > my question, so I didn't mention it. Sorry. I'm not interested in
    > what's advisable or not to use; I'm interested in what the engine
    > should accept versus throw an exception on, and hence how I need to
    > write the engine. So I am seeking clarification of what TUS would
    > like from an implementation.
    > In the past Perl has not accepted the full loose matching rules, but
    > now I have implemented what I thought were them for the
    > soon-to-be-released Perl 5.12. Perl 5 is an open-source project; I
    > am a volunteer with some background and interest in the topic, but
    > not an expert. I am, however, an expert software developer, retired
    > now, so I have some time to devote to this.
    > Based on my reading of TR18 and UAX44, I changed the Perl regex
    > engine so it would parse things like what Ken mentioned above:
    > \p{_________ -------s c r i p________--_- t ___: greek}
    > as meaning \p{script:greek}, without throwing an exception. Again,
    > it's not advisable for someone to write something like that, but it
    > appears to me to be permissible, and so I wrote the regex engine to
    > handle it.
    > I am starting out to add loose matching to the regex engine for
    > character names for the next release of Perl 5 (and I anticipate
    > adding support for named sequences in Perl by then, so for them as
    > well).
    > Effectively, it was pointed out that my reading of what I thought
    > was the plain wording of the standard might be wrong, since, if
    > there can be a space between any two characters, the concept of word
    > is meaningless, and therefore the concept of a medial hyphen is as
    > well. Conversely, if words can be run-on together, all hyphens
    > (except at the very beginning and end of the string) become medial,
    > and so the distinction is also meaningless.
    > What it means is that such names as:
    > What about
    > ?
    > What about it?
    > "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
    > the first one is missing the "T" in "CHARACTER". But then,
    > I don't suppose that was your question.
    > Sorry for the typo, and thanks for figuring out what I really meant.
    > The loose matching rules would not distinguish:
    > from
    > or for that matter, from
    > CHARACTER BZ---------------------------------------------------ZT
    > But if your question is, rather, would "CHARACTER BZ--ZT" be
    > allowed as a Unicode character name, the answer is no.
    > But the reason for that cannot be found in UTS #18. The reason
    > is because it would be stupid and pointless to name a character
    > that way,
    > and the folks in the relevant maintenance committees are not
    > stupid.
    > Of course
    > In general, if there is something unclear about matching rules
    > in the Unicode Standard, a more fruitful direction would be to
    > examine the relevant text in the proposed update for UAX #44
    > and suggest any required clarifications to the UTC, if there
    > really is an issue of ambiguity in that text. See:
    > --Ken
    > Implementers need highly precise wording in a standard. So this
    > sentence in the current UAX44 draft (thanks for the link) is
    > problematic for me:
    > UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    > hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    > If whitespace is ignored, then all hyphens are medial, and as tr18
    > points out, there would then be two other confusable cases,
    > involving what you might think of as "initial" hyphens.
    > So, I'm in a hurry. I don't have time to wait for the next draft of
    > UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
    > intended, it would be good if I knew immediately, so I could go and
    > plead that the revisions I would have to write be allowed in so that
    > the defective version would never get published.
    > My sense, though, is that I didn't misread it, that the statements
    > made in UAX34 and 44 are imprecise, and based on your responses to
    > this email, I will submit an official report through your website.

