Unicode Regular Expressions for Syllable Structure and Normalisation
Richard Wordingham
richard.wordingham at ntlworld.com
Tue May 20 03:04:10 CDT 2014
On Mon, 19 May 2014 00:06:01 +0100
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> I have done a little mathematical work on regular expressions and
> canonical equivalence, but these were true regular expressions, i.e.
> recognisable by finite automata.
> I came to the disappointing conclusion that
>
> (\p{name=COMBINING DOT BELOW}\p{name=COMBINING ACUTE ACCENT})*
>
> was not a true regular expression, at least, not in any sense that
> allows the expression to denote an infinite set of strings. One could
> define * to be restricted to character-wise concatenations that
> yielded NFD strings, but this is potentially very confusing. It
> might be argued to be in line with RL2.1 in UTS #18.
It has been observed off-list that a literal seach of an NFD string
will obviously not match \u0323\u0301\u0323\u0301. To eliminate any
confusion induced by using the example of searching, I will now talk
about the 'regular expression' x(\u0323\u0301)*y. (I did my with the
conventional task of recognising patterns.) My aim in the mathematical
work was to find a match if the pattern matches anything canonically
equivalent to the searched string. I accept that the matched substring
will in general be discontiguous, and that if the searched string is
not in NFD the substring will contain only parts of some characters.
So, when searching an NFD string for what I thought of as
'x(\u0323\u0301)*y', that is a search for any of 'xy',
'x\u0323\u0301y', 'x\u0323\u0323\u0301\u0301y',
'x\u0323\u0323\u0323\u0301\u0301\u0301y' and so on. Those who have
studied the mathematical theory of regular expressions should recall
that there is no *finite* automaton that will implement this search.
Therefore, by definition, the pattern is not a true regular
expression. That is how I came to my 'disappointing conclusion' above.
I think this limitation is not actually a problem, though it would be
something to bear in mind. Length marks in Tibetan vowels and nuktas
on Thai vowels looked the likeliest problem areas, but there seem not
to be problems.
Incidentally, although I have spoken of searching an NFD string, if one
has a finite automaton to do that, one can (given enough time and
memory) create another finite automaton that will do the same job on
unnormalised strings. The trick to the *proof* is to remember the
latest batch of non-starters in the unrearranged decomposition of the
search string, grouped by canonical combining class. (Equivalence
classes can be used to keep the numbers finite.) Whether this is
useful may be another matter!
Now, this process is very similar to converting the expansions of the
pattern string to NFD and then searching for them. However, that
conversion is itself a complicated business, similar to compiling a
regular expression. One difference is that I would like a search for the
Vietnamese vowel U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
(decomposes to 0061 0302) to match the same vowel with a tone mark for
the nặng tone, U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT
BELOW (decomposes to 0061 0323 0302).
As I asked before, is anyone using a regular expression system where the
potentially complicated pattern string does not imply NFD strings but
the search respects canonical equivalence? There are systems (e.g
HarfBuzz) that resort to changing the canonical combining classes to
make things work.
Richard.
More information about the Unicode
mailing list