From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 09 2007 - 16:58:54 CDT
Hans Aberg [mailto:haberg@math.su.se] wrote:
> > You can't answer to these questions as it interacts with the possible
> > (optional and unspecified) rules of the longest match or leftmost
> > match if
> > they are implemented, or with the ordering of matches when multiple
> > matches
> > are returned in a sequence or with enumerating methods like
> > "regexp.nextMatch()" or "regexp.previousMatch()".
>
> What is you point? - DFAs can be constructed for those.
Exactly! What's the point of DFA here (it's implementation defined, but not
related to the language itself)?
Ordering of matches in results is a semantic issue (unlike DFA which implies
not changes to the semantic), completely independant of the way your matcher
works, whever it uses a completely flattened DFA or not (because such
flattened DFA is in fact not needed, and explodes in a combinatorial way,
meaning longcompilation time andunbound memory resources).
And yes I know we can build a flattened DFA, but in most cases this is not
needed, not even for performance (in fact I have measured that the flattened
DFA, as soon as it starts exploding in size, is degrading the performance
due to data cache misses and the need to use large tables through random
indirections, a good reason not to use it for regexps).
Just to make things clear: in which order the matches for in
"([0-9]+)([0-9+]" will be returned in this string: "12345"?
The result set constains these 10 matches:
"12"
"123"
"1234"
"12345"
"23"
"234"
"2345"
"34"
"345"
"45"
Now if we use parentheses for noting in the results the capturing groups,
the same set of 10 results is decomposed like this into 20 matches:
"(1)(2)"
"(1)(23)", "(12)(3)"
"(1)(234)", "(12)(34)", "(1)(234)"
"(1)(2345)", "(12)(345)", "(123)(45)", "(1234)(5)"
"(2)(3)"
"(2)(34)", "(23)(4)"
"(2)(345)", "(23)(45)", "(234)(5)"
"(3)(4)"
"(3)(45)", "(34)(5)"
"(4)(5)"
If ordering is not defined, then all results are equally valid
If ordering is considered by the client application, it may simply take only
the first match and ignore the rest, so that the regexp will have different
meanings. Note that here the simple rule of the longest match for a
capturing group does not work (it works only for regexps without capturing
groups, i.e. a matching only the whole pattern as the default capturing
group generally noted \0 or \& in many regexp implementations).
This archive was generated by hypermail 2.1.5 : Tue Oct 09 2007 - 17:02:10 CDT