Regexes, Canonical Equivalence and Backtracking of Input

Philippe Verdy verdy_p at
Mon May 18 14:05:49 CDT 2015

2015-05-18 20:35 GMT+02:00 Richard Wordingham <
richard.wordingham at>:

> The algorithm itself should be tractable - Mark Davis has published
> an algorithm to generate all strings canonically equivalent to a
> Unicode string, and what we need might not be so complex.

Even this algorithm from Mark Davis will fail in this case:

- You can use it easily to transform a regexp containing (\u0F73) into a
regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)

- But this leaves the same problem for unbounded repetititions with the "+"
or "*" or "{m,}" operators.

- However you can use it for bounded repetitions with "{m,n}", provided
that "n" is not too large because the total number of expendaned
alternatives (without repetitions) explodes exponentially with a power
proportional to "n" (the base of the exponent depends on the basic
non-repeated string and the number of canonical equivalents it has.

Now all the problem is how to do the backtracking, and if it works, and how
to expose the matched captures (which will still be discontiguous,
including $0) and then how you can perform a safe find&replace operation:
it is hard to specify the replacement with simple "$n" placeholders, you
need more complex placeholders for handling discontiguous matches:

$n has to become not just a string, but an object whose default "tostring"
property is the exact content of the match, but other properties are needed
to expose the interleaving characters, or some context before and after the
match (notably when these contexts contain combining characters that are
NOT blocked by the match itself.

Backtracing is an internal thing before even handling matches, they occur
where there is still NO match to return, even if the regexp engine offers a
way to use a callback instead of a basic replacement string containing "$n"
placeholders, so this callback would not be called.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list