From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 18:24:36 CDT
> -----Message d'origine-----
> De : Hans Aberg [mailto:haberg@math.su.se]
> Envoyé : dimanche 14 octobre 2007 00:49
> À : verdy_p@wanadoo.fr
> Cc : 'Mark Davis'; 'Unicode'
> Objet : Re: FYI: Regex paper for UTC
>
> On 14 Oct 2007, at 00:05, Philippe Verdy wrote:
>
> > All my regexps are first normalized to alternations and negations ONLY
> > (instead of putting the negation at the beginning of the regexp
> > ONLY and using multiple binary set operators "|", "&", "\" like
> > suggested in the recent paper submitted to the UTC).
>
> The definition in terms of a language L(x) has the advantage of
> avoiding that problem, and being theoretically clearer exactly as to
> what is being defined.
Possible, but I won't support it when a match() method is running against
text. The compile() method will rewrite the regexps very simply using
negations and alternations. And I won't forbid using negation directly, so
writing "a -- b", or "a && ~b", or even "~b && a" will have exactly
equivalent behaviours and exactly the same performance... (supporting "&&"
or "--" will just be a facility).
For the rewrite rules, I just need 3 of them, not 10:
A -- B ==> A || ~B
A && B ==> ~(~A || ~B)
~~A ==> A
(parentheses are already eliminated by parsing, the rewrites are actually
not performed on the regexp input string, but on the transition graph
generated directly from the initial parsing).
In fact I support other rewriting rules like:
[ghiabcdej] ==> (a || b || c || d || e || g || h || i || j)
==> ((a \- e) || (g \- j))
to reorder and group classes heuristically (I can match ranges directly
instead of matching single characters, like in the following syntax:
[\q{00}-\q{99}] ==> 00 \- 99
And I support collation order in tweakable various locales to give meaning
to the previous notation, using ranges of strings with bounded length like
above or unbounded length like below (where I use collation in the C locale
tailored by numeric order, the value being interpreted in that locale):
(?{L=C.number} [5-\q{[0-9]+}] ) ==> (?{L=C.number} 5 \- (0 \- 9)+ )
Or simply "[5-\q{[0-9]+}]" using an external locale which may or may not be
tailoring the collation order.
The syntax will be fixed later after more tests about what can be supported
predictably. For now I support custom returns to validate partial matches
using filtered output on captures:
([0-9]+){?}
(which is much simpler, and will probably be faster, even though I can't
specify the condition of transition within the regexp itself)
This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 18:26:27 CDT