Re: Regular Expressions and Canonical Equivalence from Richard Wordingham on 2015-05-14 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 14 May 2015 20:29:06 +0100

On Thu, 14 May 2015 07:08:14 -0700
"Doug Ewell" <doug_at_ewellic.org> wrote:

> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
>
> > For example, I believe that one should be able to find
> > [...]
> > the Vietnamese letter ô U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX in the word _buộc_ 'to bind' <U+0062, U+0075, U+1ED9
> > LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>. As
> > far as I can tell, U+1ED9 is not a letter of the Vietnamese
> > alphabet; it is the combination <U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX, U+0323 COMBINING DOT BELOW> of Vietnamese letter and
> > tone mark.
>
> What you're looking for in this case is neither an NFC match nor an
> NFD match, but a language-dependent match, as you imply further down.
> <1ED9> decomposes to <006F 0323 0302>, and if you want a match with
> <00F4>, which decomposes to <006F 0302>, your regex engine has to
> reorder the marks. It sounds unlikely that you'll find such an
> engine, but there is a lot of Vietnamese-language–specific software
> out there, so you never know.

There's no more reordering than is involved in doing a Vietnamese
collation-based search, where one has to split <006F 0323 0302> up into
collating elements <006F 0302><0323>.

Possibly a back-tracking regular expression would reorder the string.

My experimental canonical-equivalence respecting regular
expression engine is designed in the same manner as the Thomson
construction - it is a non-deterministic finite automaton (except for
the effects of capturing parts of the input string) composed of a
hierarchy of non-deterministic finite automata.

States are identified as
strings of scalars following the hierarchy. The engine checks whether a
string matches a regular expression.

The engine decomposes the string to NFD. This keeps the automaton for
the concatenation of two regular expressions simple.

I will now show how it handles the search. The regular expression to
match against is \u00f4.* - a character U+00F4 followed by anything,
including nothing.

My program essentially produced the following output, with
comments added later indicated by #:

$ ./regex '\u00f4.*' ộ # Arguments are regular expression and string

Simple Unicode regex "\u006F\u0302" # First half of regular expression
# as the automaton actually sees it.

Initial states:

0) L0 # Initial state - expecting 'o', in first half of expression.
# 'L' = left.

=o=10:20:= # Gets 'o'

L0 => L1 # Changes state to expecting combining circumflex

=0323=20:30:= # Gets combining dot below (U+0323)

L1 => N001220:1:* # N => State for concatenation of regular
                   # expressions; both automata are run.
                   # 001 => Substring length within state identifier.
                   # 220 => Combining class of U+0323. Characters with
                   # this ccc or lower may no longer be processed by
                   # the left-hand automaton.
                   # : is punctuation for readability of state
                   # 1 => Left half still expecting combining circumflex
                   # * => only state for regex ".*".

=0302=30:06:= # Gets combining circumflex (U+0302)
# The engine runs a non-deterministic finite
# automaton. It now branches to 3 states.

N001220:1:* => N001220:M:* # Left half has now reached end of expected
# string.

N001220:1:* => R* (match) # On transferring to an accept state of
# \u00f4 automaton, only the .*
# automaton needs to # be processed.

N001220:1:* => N001230:1:* # Possibly the combining circumflex is to
                           # match the '.*'. The combining class is
                           # updated to 230. This 230 will actually
                           # block the \u00f4 automaton from reaching
                           # an accept state from this state, for a
                           # combining circumflex can henceforth only be
                           # considered by the .* automaton.

At no point has the input string been reordered.

Richard.

>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>
>
Received on Thu May 14 2015 - 14:30:40 CDT

This archive was generated by hypermail 2.2.0 : Thu May 14 2015 - 14:30:40 CDT