From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 10 2007 - 20:39:07 CDT
No problem for me too: it effectively gives a strong semantic to the NOT
operator.
Except that I use the "rewrite" operations exactly in the reverse direction,
the result of these rewrites is that there may be NOT operators everywhere
in the final regexp, but only ONE binary operator, i.e. the || operator for
alternation (it eliminates all && and all -- from the regexp):
x -- y ==> ^(^x || y)
x && y ==> ^(^x || ^y)
^^x ==> x
(only three rewrite rules needed, instead of 10, only the last above is
common between the two sets of rewrites).
This gives identical results (because the expressions are equivalent), but
uses a single transition rule in the matcher, where each transition can be
either positive or negative, but part of the deterministic state.
All the parentheses (except capturing groups) are eliminated as part of the
construction of the transition graph, so this is no complication. The
remaining NOT operations all become the negation of the transition
condition. So the transition graph can be flattened in a single layer with
only positive conditions.
This is simpler (for me) to implement than supporting the three junction
types (&&, ||, --) in junction nodes (because it would require three
junction types in nodes, in addition to the complexity of
epsilon-transitions needed for capturing groups), and allows all transitions
to be performed much more simply by using only two sets: the union of all
positive input transitions and the union of all negative input transitions.
> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de Mike
> Envoyé : mercredi 10 octobre 2007 21:51
> À : Unicode
> Objet : Re: FYI: Regex paper for UTC
>
> > Andy and I put together a paper of recommendations for the UTC at
> > http://docs.google.com/Doc?id=dfqr8rd5_32kv97tx
>
> These recommendations look good to me.
This archive was generated by hypermail 2.1.5 : Wed Oct 10 2007 - 20:43:09 CDT