Date/Time: Mon May 7 23:14:32 CDT 2012
Contact: tchrist@perl.com
Name: Tom Christiansen
Report Type: Public Review Issue
Opt Subject: feedback on pri182
This is a piece of mail I sent to Mark Davis this morning. It has two different concerns. The first has to do with various forms of case-insensitive matching, as discussed in pri182. The second has to do with errors in the current definitions of \R and \X in tr18. I have made very minor corrections to the mail compared to the form in which it was originally sent. Please also consult Russ Cox’s Karl Williamson’s followup message which include further and better examples than my own. --tom Tom Christiansen Boulder, Colorado ====================================== From: Tom Christiansen <tchrist@chthon> To: <mark@macchiato.com> CC: karl williamson <public@khwilliamson.com>, Russ Cox <rsc@google.com> Date: Mon, 07 May 2012 07:57:08 MDT Subject: Re: Preview of working draft for #18 In-Reply-To: <CAJ2xs_GQN_majaKM6vszFr45S=JyBZfbrxpeWj687XcuDj7e6A@mail.gmail.com> [Russ cc'd to make sure I'm not talking about him behind his back :) and so he can correct me if I've unwittingly misstated something relevant to his work.] Mark Davis <mark@macchiato.com> wrote on Sun, 06 May 2012 21:16:26 -0700: > I hadn't heard anything from either of you on the note I sent you back in > February about the proposed update to #18: > http://www.unicode.org/review/pri182/. > I want to make sure you realize that during the meeting this week, the UTC > may approve that draft. So if you do have any feedback, it should be sent > in in as soon as possible. It does make some conformance changes, so it > would be very helpful if you could review. Mark, Thank you for the note. I've looked at pri182, and agree with most of the changes without any comment. I've also read over Karl's feedback, and think he addresses the issues of case-insensitive matching pretty well. It's a real conundrum. In recent discussions I've had with Russ Cox, a couple of other things have come up that might stand consideration. The first has bearing on Karl's feedback regarding case-insensitive matching of properties, so I am sending it to you directly. The issue involves transitivity. It's most easily illustrated by considering what dot — or if you would, \p{Any} — should match under case-insensitive mode. Since dot can match any code point, it could for example match U+00DF. But U+00DF could match such things as "ss", "SS", paired long s's, etc. So now suddenly you have \p{Any} matching more than any, which is darned weird. This is the problem that Karl discovered with something like [\x80-\xFF] under case-insensitive mode. Notice that 0xDF is in that range, which means that "SS" et alios are also in that range. It is very bizarre, or at least counterintuitive for traditional regexes, to have a bracketed character class match multiple code points, since just as with dot, a bracketed character class "smells" like it "should" match a single code point. This becomes even worse with negated classes. Consider that [^\x80-\xFF] matches something that cannot be matched by something in that range. Since 0xDF is in that range, under full casefolding trying to match the string "guess" against a pattern like /(?i)^[^\x80-\xFF]+$/ would necessarily fail, since "guess" contains within it an instance of something that one of its range elements *could* match, since one of its range elements is 0xDF. In Perl case-insensitive notation, "guess" =~ /\xDF/i is true (and the length of the match is 2). Therefore its logical complement "guess" =~ /^[^\x80-\xFF]+$/i is necessarily false. It is shocking to say that a purely ASCII string can *fail* to match something that asserts it has no Block=Latin1 characters in it. This paradox held up our release last year trying to figure out what to do about this. In the end, we rescinded full casefolding on inverted bracketed character classes alone, but left it intact elsewhere. I don't know whether that was the best solution. Actually, I'm pretty sure it isn't. I do not at all like things working differently under different circumstances, because that will again lead to paradoxes, and paradoxes to surprises, and surprises to incorrect code. And above all, we don't want incorrect code. If you're looking for prior art, I know of only three pattern matching languages that support full casefolding. These are Perl, Ruby, and Matthew Barnett's "regexp" library for Python. Russ Cox reminds me that casefolding rules should apply to everything, and that this leads to paradoxes. Would you expect this to fail? "guess" =~ /^\P{Block=Latin1}+$/i Does not "guess" consist of all non-Latin1 character? Well, sure, but the case-insensitive mode throws it open to the paradox. In the same way, he holds that "guess" =~ /^.+$/i or "guess" =~ /^\p{Any}+$/i should also fail due to transitivity. I see his point, but I don't like where all this leads. EDITED TO ADD: Here is a portion of Russ's mail: > However, it does seem to me that since /ß/i matches "ss" and . or > \p{Any} are logically a large alternation of individual characters, > that implies that /./i matches "ss", in the same way that /(A-Z)+/i > matches "abc". So I think that there are certain unexpected > _positive_ matches implied by full case folding. I would hesitate to > suggest that there are unexpected negative matches, since case folding > typically only adds alternatives. (The exception would be expressions > like /^[^class]+$/ in which the negation of an unexpectedly enlarged > class turns out to be unexpectedly small.) > > As an illustration of the positive match problem, I find it very hard > to draw a dividing line anywhere in this sequence: > "guess" =~ /^guess$/i > "guess" =~ /^gueß$/i > "guess" =~ /^gue[ß]$/i > "guess" =~ /^gue[ß]$/i > "guess" =~ /^gue[\xC0-\xFF]$/i > "guess" =~ /^gue[\x00-\x{10FFFF}]$/i > "guess" =~ /^gue.$/i > "guess" =~ /^....$/i > "guess" =~ /^.{4}$/i > > And yet I think we all agree that the first two are necessarily true > and that the last is somewhat ridiculous. At what point, though, does > it go wrong? As you know, Mark, I believe the only sane way to do this is to rewrite full_casefold(STRING) =~ /PATTERN/ That is, apply the mapping on the string, then write the pattern with that in mind, and have no (?i) mode operant. I do not mean to dissuade the committee from recommending this, but be aware that there remain at least three problems with this approach: 1. performance: you need auxiliary space to hold the casefold of each string you ever wish to match case-insensitively 2. correctness: you now force people to understand the casefold of arbitrary strings, which is really asking a great deal. But they have to know that in order to write that pattern correctly. 3. atomicity: it is common to wish to apply case-insensitivity to only particular subportions of a match, but not to others. For example: STRING =~ /PAT1(?i:PAT2)PAT3/ This is normal, where only PAT2 but not PAT1 nor PAT3 are matched irrespective of case. This cannot be accomplished using the strategy of mapping the string to be matched against to its casefold; that is, there is no way to write this correctly: full_casefold(STRING) =~ /PAT1(?i:PAT2)PAT3/ Because the remapping has already been already done on the whole string. It is an indivisible mapping, and that is too much atomicity for customary use. All these same issues arise equally with any notion of "canonical matching", or "matching as though the string were in NFD," or any other normalization form. That is, this approach: nfd(STRING) =~ /PATTERN/ is one that I believe is necessary for sanity, but it has all the same three problems/issues/concerns as I have just raised regarding matching done in case-insensitive mode: it takes both time and space, it requires sophistication (=expertise) and care (=diligence) on the pattern's author, and it is an indivisible all-or-nothing step. EDITED TO ADD: Even if the committee follows up with pri182's recommendation to relax the requirement of full casefolding to allow for matching to be done using simple casemapping alone, and to require a mapping as I have shown above for full casefolding to avoid the more egregious of the paradoxes, all three of my issues (performance, correctness, atomicity) still apply to the rewrite scheme. It might be worth mentioning these neccesarily attendant issues for those who decide to pursue using the rewrite approach. Furthermore, some paradoxes remain even under simple casefolding. This is because the simple casefolds of LATIN SMALL LETTER LONG S and KELVIN SIGN both have simple casefolds in the ASCII range. Therefore this: "guess" =~ /^\P{Block=Latin Extended-A}+$/i which says much contain only code points that are *not* from that block, would *fail* even though it indeed contains only code points that are not in that block. That's because LATIN SMALL LETTER LONG S simple casefolds to plain s, which is present in the string (twice). Therefore it would fail, and really shock people (I feel). Nobody is expecting this paradox. Notice how getting rid of full casefolding does not banish all paradoxes! Karl further reports: > The proposed draft does explicitly use the Phonetic_Extensions block as > an example of one that isn't closed under /i. I submit that the above > text, which appears to have come from Russ, suggests that he and Tom, > like me, would be surprised that blocks match differently under /i vs > not. And that reinforces my point that doing so will lead to bugs and > programmer time wasted trying to figure out what's going on. Asmus last > year was of the opinion that all properties should be immune to /i > matching. It's clear from Perl user feedback that they do expect the > casing properties to match differently. But, I believe they would be > surprised, and unhappy, to find that /\p{ASCII}/i matches the KELVIN > SIGN and LATIN SMALL LETTER LONG S. END ADDITION ========================= The second issue is one that has been staring at us for a long time in tr18. It does not relate to the proposed changes, however, and so I have not submitted it against the proposed update. It is still a real issue, though, and needs to be clarified in some future draft, because otherwise implementers cannot know to do it consistently (or arguably, correctly). Karl and I, and Russ and I alike, have discussed how there is a genuine problem with tr18's definition of \R (a linebreak grapheme) and \X (any extended grapheme cluster). The definitions given in tr18 for those **ARE NOT SUFFICIENT** to guarantee correct behavior. This must be fixed. Again using Perl notation, "\r\n" =~ /^\R$/ or more explicitly "\r\n" =~ /\A\R\z/ works correctly given that \R must mean (?:\r\n|\p{VertSpace}) or, in (?x) mode for legibility / breathing room: (?: \r\n | \p{VertSpace} ) However, that simplistic definition causes these to match: "\r\n" =~ /^\R{2}$/ "\r\n" =~ /^\R\p{Any}$/ and that is wrong: you can't back up and reconsider an alternate match for the linebreak grapheme. That's why Perl implements \R as a "possessive" match. (?>\r\n|\p{VertSpace}) That means that once it's decided on \r\n, it can never backup to retry the second branch in the alternative. You can do it that way in a backtracking matcher, but not in traditional, non- backtracking finite automaton. There you must write it with a lookahead negation, such as having \R actually match (?: \r\n | \p{VertSpace} ) (?! \p{VertSpace} ) That is, it forces the longest possible sense of each \R linebreak grapheme. EDITED TO ADD: That of course is not correct. Russ Cox mentions in his own mail the following: > Tom's suggested definition for \R is not quite right, I don't think, > as it prohibits \p{VertSpace} before another \p{VertSpace}. A > possibly better definition might be > > (?:\r\n|(?!\r\n)\p{VertSpace}) > > Something similar should be possible for \X. END ADDITION Again, this is not implementable in a matcher that forbids lookaheads. However, I believe (and Russ will please correct me if I am wrong) that that restriction *could* be suspended for a one-codepoint lookahead in a way that does not compromise the computability requirements of the matcher. This I believe because such matchers can already implement boundaries like \b and \B, so I think they can also implement this sort of thing, were they so inclined. EDITED TO ADD: Russ confirms that this can be indeed still be done in a DFA/NFA environment. See his mail for details. END ADDITION The same issue arises with \X. Consider nfd("café") =~ /^caf\X$/ # this must pass nfd("café") =~ /^caf\X.$/ # this must FAIL You cannot ever interpret \X as matching the "e" alone to leave room for dot to then match COMBINING ACUTE ACCENT, because that would cause your \X to break at the wrong kind of boundary. Perl addresses this by implementing \X as a possessive match. All this is far simpler to explain if we pretend that \X matches (?:\PM\pM*). It *doesn't*, of course, because an extended grapheme cluster is quite a bit more complex than that. However, that simplistic toy definition suffices to illustrate the problem. Suppose that \X really meant (?:\PM\pM*) then the backtracking would mess up the second case. This would match, but it needs to fail: nfd("café") =~ /^caf\X.$/ # this must FAIL That's why it has to be one of (?>\PM\pM*) or (?:\PM\pM*+) depending on your notational preferences. Again, this poses a problem for NFA/DFA implementations, since backtracking is meaningless there. You have to rewrite it all with a lookahead conformant with the grapheme boundary conditions spelt out in tr29 on Unicode Text Segmentation. EDITED TO ADD: I believe that a non-backtracking engine (thus without "possessive" matches) could implement this (admittedly toy) definition of a grapheme cluster by changing it to (?:\PM\pM*)(?!\pM) That should stop it from breaking up a mark that should be connected. END ADDITION Off the top of my head, I believe the actual definition for \X must be more like this (in /x mode of course): (?> \R | \p{Grapheme_Base} \p{Grapheme_Extend}*+ | \p{Grapheme_Extend}++ | \p{Any} ) and where \R is also the possessive version described previously. Again, that only works if your matcher can do possessive matching. Otherwise you have to do something much more complicated with special-purpose lookahead negations. Which assumes then that they can at least do that. I think a single code-point lookahead should again suffice here with \X as it did with \R, but I have not thought the matter through; tr29 is rather sticky. As with \R, for \X I do not believe that any theoretical concerns regarding computability are insurmountable, but I know of no NFA/DFA implementation of \X. I again hope that Russ will correct me if I am wrong about any of that. Mark, please feel free to repost, redistribute, or use this letter in part or in whole in any way you might see fit. Also please do not hesitate to request that I split up the issues and submit them as separate feedbacks, although if you do that, I could perhaps stand some guidance. I've put it all in this long note because of the urgent timing. If you have any questions or comments about any of this, I will try to respond with all possible alacrity, in case it may have bearing on your meeting. Hope this helps, --tom Tom Christiansen Boulder, Colorado