From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 17:51:23 CDT
Hans Aberg [mailto:haberg@math.su.se]
> This is not the language complement. The language set operators are
> just the normal set operators inside the set C*, where C is the
> character set. See for example, Aho, Sethi & Ullman, "Compilers" (the
> "dragon book") on REs.
I have studied this book 18 years ago, and I still have it. Implementing
compilers was a significant part of my student projects, and I have used it
at work many times since then to implement various compiler-like tools (not
just regexps).
I don't know why you want to restrict the negation to only a small part of
what they can do.
My negation operator (like all other regexp operations) IS operating as a
injection from a set of positioned strings=(C*, P) to (C*, P) where P is a
set of positions within strings of C*. It is more general and solves the
particular case of injections from C* to C* (where P is then only a
singleton based on a particular rule).
Note: my engine does not look first for the longest match. Longest matches
are found by filtering, using customizable rules after each matche (whatever
its size or position) is found. Why?:
Consider an input text coming from a stream A that generates an infinite
suite of characters "a".
Feed this stream to a filter that will return the longest match for the
regexp "a*". This filter will never return unless it applies the technic of
optimization for the conversion of a final loop into a final recursion.
Now use the same filter but with the regexp "a*b": this time it's impossible
for the filter to return anything.
The "smallest leftmost match first" rule guarantees a return after a finite
number of steps (which does not depend on the input size), and using bounded
resources (memory, time, number of state variables), even in the two last
cases above. The bounds for resources needed are also fully predictable only
from the regexp and they don't grow in a combinatorial way. It will
enumerate EXACTLY the same matches from non infinite input streams (such as
text files), but just in a different order.
And one can still enumerate only the longest matches by filtering
sequentially the output using a single prefetch variable for the candidate
matches returned by the previous strategy. In addition this process is
easily parallelizable to look for matches in very long texts using as many
processor cores as needed, each one using small amount of local only data
without synchronization between the running threads.
Given these advantages (plus the bonus that bounded resources offers for
system security), I'm sure my strategy is correct, as it does not remove any
productivity to the classic regexps that are still supported (including
POSIX ed/sed/vi or POSIX lex, or the proposed Unicode regexps).
I am even starting to modify my code to support also named subregexps that
allow matching paired parentheses for example (impossible with traditional
regexps that can't "count" them):
(?=paren= (a | [(]\=paren=[)] | [[]\=paren=[]] | | [{]\=paren=[}])+ )
Or matching equal substrings (like "a+a" or "xyz+xyz" but not "a+b"):
(?=samename= [a-z]+ ) + \=samename=
I'm not sure for now about howto encode the operators syntax, I'm looking
for other references... It will require adding a stack in the engine state,
for handling the case of non-initial and non-trailing recursive regexps
(like the "paren" definition above); for non-recursive named regexps, or
left-recursive or right-recursive regexps, they can be rewritten using
normal counters like "+" and "*".
This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 17:52:46 CDT