Re: Regular Expressions and Canonical Equivalence from Philippe Verdy on 2015-05-17 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 17 May 2015 16:33:15 +0200

2015-05-16 22:33 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> On Sat, 16 May 2015 18:29:18 +0200
> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
> > 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
> > richard.wordingham_at_ntlworld.com>:
> >
> > > There is an annoying error. You appear to assume that U+0302
> > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but
> > > they don't; they have the same combining class, namely 230. I'm
> > > going to assume that 0303 is a typo for 0323.
> >
> >
> > Not a typo, and I did not made the assumption you suppose because I
> > chose then so that they were effectively using the **same** combining
> > class, so that they do not commute.
>
> In that case you have an even worse problem. Neither the trace nor the
> string \u0303\u0302\u0302 matches the pattern
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the
> regular expression
> (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\
> 0303|˕\0303˕\u0302)*˕\u0302˕
>
> You've transformed (\u0302\u0303) into (˕\u0302˕\0303|˕\0303˕\u0302),
> but that is unnecessary and wrong, because U+0302 and U+0303 do not
> commute.

Oh right! Thanks for pointing, it was intended you can read it as.

(˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\0303)*˕\u0302˕

But my argument remains because of the presence of \0302 in the second
subregexp (which additionally is a separate capture, but here I'm not
concentrating on the impact in numbered captures, but only on the global
capture aka $0)

> > It was the key fact of my argument that destroys your argumentation.
>
> However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
> corrected new regex
>
> (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕
>
> Do you claim that this argument is destroyed? If it is irrelevant, why
> is it irrelevant? It shows that your transform does not solve the
> original problem of missed matches.
>

Why doesn't it solve it? Note that the notation with tacks is just the
first transform. Of course you can optimize it by factorizing the common
prefixes in each alternative. In the following the 1st and 4th tacks have
some common followers in their lists of characters or character classes
they expect (for advancing to the next tack), but the 2nd and 5th tack
expect different followers.

(˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕

OK I understand the need for "counting" characters present in regexps when
they are sharing the same combining classes, but counting does not work
correctly, in fact you have to keep counters for each distinct combining
character with non-zero combining class for how they contribute to the
total length of the "star" group. They also don't contribute necessarily to
the same total when the regexp specifies them multiple times (a simple
measurment of the total length of the group is evidently not enough, all
counters must be exact multiples of the number of occurences (counter[c])
of each combining character (c) in the original untransformed content of
each alternative in the star group, and the second factor n of this
multiple must be identical for all counters

The total length is in pseudo-code: { sum=0; for(c:v in counter) sum +=
v; return sum; } but it has no use by itself.
If the number of (non-repeated) original untransformed alternative are
in mustoccur[] the check to perform is this pseudo-code:

      var n = null; foreach(c:m in mustoccur) {
        checkthat(counter[c] % m == 0);
        if (n == null) n = counter[c] / m;
        else checkthat(counter[c] / m == n);
      }

> > Reread carefully and use the example string I gave and don't assume I
> > wanted to write u0323 instead of u0303.
>
> I'm not at all sure what your example string is

My example was the original regexp without the notation tacks:
(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302

It exposes some of the critical difficulties, first for returning correct
global matches (but then also for for captures, and the effect of
"ungreedy" options of Perl (and PCRE working in Perl-compatible mode or in
extended mode) and most regexp engines (whose default behavior is
"greedy"): the "ungreedy" option causes significant slowdowns with
additional rollbacks or more work to maintain an efficient backtracing
directly in the current state of the automata (if you attempt to use
deterministic rules everywhere it is possible).

But we know that it's not possible for all regexps in general, otherwise
regexp engines would just be simple LR(n) parsers with bounded n, or even
simpler LALR parsers like Bison/Yacc but without their backtracing support
for "shift"/"reduce" ambiguities, these LALR parsers are also greedy by
default and resolve ambiguities by "shifting" first, leaving the code
determine what to do when after shiting there's a parse error caused by
unexpected values, but LALR parsers do not have a clean way to handle the
correct rollback to the last ambiguous shift/reduce state with a special
match rule, and they do not support trying "reduce" first to get the
"ungreedy" behavior as they cannot return to this state to choose the
"shift" alternative).

That's why since long lexers are written with regexps, and syntaxic
scanners written preferably with LALR parsers which cannot work alone
without a separate lexer. But using LALR parsers does not work with common
languages like Fortran; it works for parsing language like C/C++ because
they are specified so that shift/reduce ambiguities are resolved using
"shift" always (i.e. the greedy behavior).

Very few parser generator support both working mode (except the excellent
PCCS that I have used since the early 1990's when it was still not
rewritten in Java and was a student project in Purdue University, and that
combines all the advantages of regexps and LR parsers, with very clean
control of backtracing, it also supports layered parsing with multiple
local parsers if needed, even without wrining any piece of output code, you
can describe the full syntaxic and lexical rules of almost all languages in
a single specification).
Received on Sun May 17 2015 - 09:34:51 CDT

This archive was generated by hypermail 2.2.0 : Sun May 17 2015 - 09:34:51 CDT