From: karl williamson (public@khwilliamson.com)
Date: Thu Mar 11 2010 - 13:45:49 CST
Mark Davis ☕ wrote:
> I agree that the wording should be clearer. What is meant by
>
> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>
>
> is that when matching two strings, transform each in the following way.
>
> 1. remove all hyphens that are medial (except in U+1180) then
> 2. remove whitespace and underscore, and lowercase.
>
> If after these transforms, the two strings are the same, then they match.
>
> This is a logical statement: you can do the transformations in a single
> pass if you are careful, and you also can do the comparison while
> transforming incrementally.
>
> Mark
>
Ok. Thank you. That's totally clear and implementable. I just want to
be sure that you realize that this means that if the user writes
TIBETAN LETTER-A
the rules above yield
tibetanlettera
which maps to
TIBETAN LETTER A
and not to what they more likely meant
TIBETAN LETTER -A
So therefore in this (and in TIBETAN SUBJOINED LETTER -A) the white
space before the '-' is significant, and that isn't mentioned in the
documents, except tr18.
>
> On Thu, Mar 11, 2010 at 09:34, karl williamson <public@khwilliamson.com
> <mailto:public@khwilliamson.com>> wrote:
>
> Kenneth Whistler wrote:
>
> The loose matching rules in TR18 say to ignore white
> space, underscores, and hyphens. That means that
> someone could insert white space into the middle of
> what is supposed to be a single word, like
> \p{s c r i p t: greek}. Same for character names.
>
> Actually, it doesn't mean that you can arbitrarily ignore
> the identifier syntax of particular formalizations.
>
> I don't understand your sentence. I'm guessing you mean that
> 's c r i p t' is not the same as 'script', even though tr18
> says "case distinctions, whitespace, hyphens, and underbar
> are ignored." If so, shouldn't tr18 be clarified?
>
>
> I should have said "pattern syntax" rather than "identifier syntax"
> in this case, but the point is that while UTS #18 makes
> a general statement about how pattern matching for property
> names and values should be done, you still have to pay attention
> to the details of the actual implementations.
>
> Without checking an actual implementation of java.util.regex Class
> Pattern, I don't know whether:
>
> \p{_________ -------s c r i p________--_- t ___:
> greek}
>
> would actually match the Unicode Script property or would
> throw a PatternSyntaxException.
>
> You can try it and find out, I suppose. But that isn't
> really so much an issue for UTS #18 but rather something to take
> up with the implementers of Java, Perl, and other regex
> engines.
>
>
> The reason I'm asking this is that I am an implementer of Perl's
> regex engine. I didn't realize that that fact would be germane to
> my question, so I didn't mention it. Sorry. I'm not interested in
> what's advisable or not to use; I'm interested in what the engine
> should accept versus throw an exception on, and hence how I need to
> write the engine. So I am seeking clarification of what TUS would
> like from an implementation.
>
> In the past Perl has not accepted the full loose matching rules, but
> now I have implemented what I thought were them for the
> soon-to-be-released Perl 5.12. Perl 5 is an open-source project; I
> am a volunteer with some background and interest in the topic, but
> not an expert. I am, however, an expert software developer, retired
> now, so I have some time to devote to this.
>
> Based on my reading of TR18 and UAX44, I changed the Perl regex
> engine so it would parse things like what Ken mentioned above:
>
> \p{_________ -------s c r i p________--_- t ___: greek}
> as meaning \p{script:greek}, without throwing an exception. Again,
> it's not advisable for someone to write something like that, but it
> appears to me to be permissible, and so I wrote the regex engine to
> handle it.
>
> I am starting out to add loose matching to the regex engine for
> character names for the next release of Perl 5 (and I anticipate
> adding support for named sequences in Perl by then, so for them as
> well).
>
> Effectively, it was pointed out that my reading of what I thought
> was the plain wording of the standard might be wrong, since, if
> there can be a space between any two characters, the concept of word
> is meaningless, and therefore the concept of a medial hyphen is as
> well. Conversely, if words can be run-on together, all hyphens
> (except at the very beginning and end of the string) become medial,
> and so the distinction is also meaningless.
>
>
> What it means is that such names as:
>
> CHARACTER BZZT
> CHARACTER B-ZZ-T
> CHARACTER BZ-ZT
>
> What about
> CHARACER BZ--ZT
> ?
>
>
> What about it?
>
> "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
> the first one is missing the "T" in "CHARACTER". But then,
> I don't suppose that was your question.
>
>
> Sorry for the typo, and thanks for figuring out what I really meant.
>
>
> The loose matching rules would not distinguish:
>
> CHARACTER BZZT
>
> from
>
> CHARACTER BZ--ZT
>
> or for that matter, from
>
> CHARACTER BZ---------------------------------------------------ZT
>
> But if your question is, rather, would "CHARACTER BZ--ZT" be
> allowed as a Unicode character name, the answer is no.
> But the reason for that cannot be found in UTS #18. The reason
> is because it would be stupid and pointless to name a character
> that way,
> and the folks in the relevant maintenance committees are not
> stupid.
>
>
> Of course
>
>
> In general, if there is something unclear about matching rules
> in the Unicode Standard, a more fruitful direction would be to
> examine the relevant text in the proposed update for UAX #44
> and suggest any required clarifications to the UTC, if there
> really is an issue of ambiguity in that text. See:
>
> http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
>
> --Ken
>
>
> Implementers need highly precise wording in a standard. So this
> sentence in the current UAX44 draft (thanks for the link) is
> problematic for me:
>
> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>
> If whitespace is ignored, then all hyphens are medial, and as tr18
> points out, there would then be two other confusable cases,
> involving what you might think of as "initial" hyphens.
>
> So, I'm in a hurry. I don't have time to wait for the next draft of
> UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
> intended, it would be good if I knew immediately, so I could go and
> plead that the revisions I would have to write be allowed in so that
> the defective version would never get published.
>
> My sense, though, is that I didn't misread it, that the statements
> made in UAX34 and 44 are imprecise, and based on your responses to
> this email, I will submit an official report through your website.
>
>
>
>
>
This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 13:48:33 CST