Regular Expressions and Canonical Equivalence
Richard Wordingham
richard.wordingham at ntlworld.com
Sun May 17 19:03:02 CDT 2015
On Sun, 17 May 2015 16:33:15 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 2015-05-16 22:33 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>
> > On Sat, 16 May 2015 18:29:18 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
> > > richard.wordingham at ntlworld.com>:
> > >
> > > > There is an annoying error. You appear to assume that U+0302
> > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute,
> > > > but they don't; they have the same combining class, namely
> > > > 230. I'm going to assume that 0303 is a typo for 0323.
> > >
> > >
> > > Not a typo, and I did not made the assumption you suppose because
> > > I chose then so that they were effectively using the **same**
> > > combining class, so that they do not commute.
> >
> > In that case you have an even worse problem. Neither the trace nor
> > the string \u0303\u0302\u0302 matches the pattern
> > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match
> > the regular expression
> > (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\
> > 0303|˕\0303˕\u0302)*˕\u0302˕
> >
> > You've transformed (\u0302\u0303) into
> > (˕\u0302˕\0303|˕\0303˕\u0302), but that is unnecessary and wrong,
> > because U+0302 and U+0303 do not commute.
>
>
> Oh right! Thanks for pointing, it was intended you can read it as.
>
> (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\0303)*˕\u0302˕
>
> But my argument remains because of the presence of \0302 in the second
> subregexp (which additionally is a separate capture, but here I'm not
> concentrating on the impact in numbered captures, but only on the
> global capture aka $0)
>
>
> > > It was the key fact of my argument that destroys your
> > > argumentation.
> >
> > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
> > corrected new regex
> >
> > (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕
> >
> > Do you claim that this argument is destroyed? If it is irrelevant,
> > why is it irrelevant? It shows that your transform does not solve
> > the original problem of missed matches.
> >
>
> Why doesn't it solve it?
Sorry, my example wasn't quite right. It should have two combining
dots below and five circumflexes, not four as I wrote it. I will
first explain how my NDnear-FA handles it - I have now removed the
generation of the dead end states.
Initial states:
0) LLLL0 # Starting the \u0302\u0302\u0323 factor,
# implemented as \u0323\u032\u0320
1) LLRM # Completed the zero trip alternative to (\u0302\u0302\u0323)+
# Not actually useful.
2) LRLL0 # Starting the \u0302\u0303 factor
3) LRRM # Completed the zero trip alternative to (\u0302\u0303)+
4) R0 # Starting the \u0302 factor
=0323=00:06:=
LLLL0 => LLLL2 # \u0323\u0302\u0302 factor progressed as far as \u0323
=0323=06:012:=
LLLL2 => LLLN001220:2:L2 # Progressing 2 successive repeats of factor.
# Both have progressed as far as \u0323.
# Finiteness would restrict me to, say, 3 repeats
# in progress.
# The states of the finite DFA are a cross product of 3 copies of
# the DFAs for \u0323\u0302\u0302 and 2 copies of the set of relevant
# ccc values. By no means all of these states are used.
# In the Kleene stars of the regular expression guaranteed by
# recognisability, 3 copies caters for the worst case, xyz, where x
# has a starter and ends in a non-starter, y consists of non-starters
# with the same canonical combining class, and z starts with
# non-starter and contains a starter, e.g.
# x = \u0f40\u0f74, y = \u0f7a\u0f7a\u0f7a, z = \u0f71\u0f42
# to_NFD(xyz) = \u0f40\u0f71\u0f7a\u0f7a\u0f7a\u0f74\u0f42
=0302=012:018:=
LLLN001220:2:L2 => LLLN001220:4:L2 # Still progressing two factors
# First has progressed to \u0323\u0302 and second to
# \u0323. The other way round has been pruned by the
# automated observation that if \u0302 is blocked from
# first factor, the factor cannot be completed.
=0302=018:024:=
LLLN001220:4:L2 => LLLN001220:M:L2 # Completed the first factor
LLLN001220:4:L2 => LLLL2 # As first factor is complete, remove it from
# consideration and relabel second factor as
# first.
=0302=024:030:=
LLLL2 => LLLL4 # \u0323\0302\u0302 completed as far as \u0323\u0302
=0302=030:036:=
LLLL4 => LLLLM # \u0323\u0302\u0302 is complete.
LLLL4 => LRLL0 # So start \u0302\u0303 factor.
LLLL4 => LRRM # Alternatively, completed the zero trip option of
# (\u0302\u0303)*
LLLL4 => R0 # Or, we have progressed as far as the final \u0302
LLLL4 => LLLL0 # Or, start another \u0323\u0302\u0302
=0302=036:042:=
LRLL0 => LRLL2 # Got as far as \u0302 in \u0302\u0303
R0 => RM (match) # Or completed the final \u0302.
End marker is at 042:OVF
Could you please talk me through how your system recognises the string
\u0323\u0323\u0302\u0302\u0302\u0302\u0302 as matching the regex. I
can't work out how it is supposed to work from your description.
Richard.
More information about the Unicode
mailing list