RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Sep 24 2007 - 20:30:20 CDT

Next message: Richard Cook: "Re: Composition of not included Chinese characters"

Previous message: John H. Jenkins: "Re: Composition of not included Chinese characters"
In reply to: Doug Ewell: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:
> "Mike" <mike dash list at pobox dot com> wrote:
> >> I'd just like to point out that a "[ ]" regular expression is defined
> >> to match always exactly one character (if it matches at all).
> >
> > Correct. Except that a Spanish speaker would consider "ch" to be a
> > single character even though you need two code points to represent it.
>
> I don't think it will ever really be feasible to define regular
> expressions in terms of specific languages, to the point of treating
> combinations of two or more base characters as a single matchable
> "character" on the basis that speakers of language X consider the
> combination to be a single "letter."

The problem is not much about language-specific conditions, but about the
expressiveness of regular expressions; in typical regexp matchers, the
implementationis most often based on simple lexers that try finding matches
for some limited subsets of language production rules.

But if you reinterpret [set] only as an abbreviated form for writing (s|e|t)
and use some matching algorithm similar to what is implemented in language
compilers, without assuming the separation between lexers and scanners (by
avoiding the classical separation between lex and yacc in Unix tools), but
merging the two, you can builkd much more powerful matchers.

The complexity here comes from the interpretation of negated sets: you need
to define the universe of searches, or must be able to extend it according
to your input regexps; this is possible, because the regexp itself defines
its own universe about the matchable subunits it tries to accept or reject,
and because any text containing subunits not used in the regexp would still
be matchable by considering that all subunits absent from the regexp itself
is part of a [^.] class, for which you don't need to have a exhaustive
definitionofits elements (this class contains an infinite number of
elements).

So [^\q{string1}\q{string2}] makes sense. To compile this expression you
need a recognizer for \q{string} which can be built by recognizing it
character by character (so by including each letter /s/, /t/, /r/, /i/, /n/,
/g/, /1/, /2/ in your finite universe, then interpreting the regexp as if it
was written as (?not! ("s" "t" "r" "i" "n" "g" "1")|("s" "t" "r" "i" "n" "g"
"2"), and that your compiler will interpret when building the finite-state
automata as : (?not! "s" "t" "r" "i" "n" "g" ("1"|"2") ).

The additional complexity in this type of automata is that each node in the
parser tree can have positive and negative links to other nodes. Traditional
parsers and scanners only have positive links, and try to map input letters
of their alphabet only to one of the positive links (where only the absence
of a match at each parsing step means that the regexp will not match.

Another complexity that your compiler will have to manage: think about the
meaning of expressions like [\p{script=Latn}\p{gc=Ll}\u0020-\u0FF]: it is
not a simple union of separate character classes but actually means
(\p{script=Latn} | \p{gc=Ll} | [\u0020-\u0FF]) where each alternative can
match or not match independently.

If you wanted to make a complete mapping to a single state number, you would
have to build 3 dimensional table with 2 coordinates in each dimensions, one
for "match", one for "does not match. The total number of states goes then
from 2+2+2=2×3=6 to 2+2+2=2³=8 : this completely changes the order of
magnitude for the problem, so rapidly the number of states to represent
would explode.

So the most effective way to build your recognizer is to allow each
alternative to be recognized independently, and to build a parsing tree like
this :

|(node 0)
+--> \p{script=Latn} matches? -+-- (Yes) --> (node 1) accept
| |
| +-- (No) ---> (node 2) reject
|
+--> \p{gc=Ll} matches? -+-- (Yes) --> (node 3) accept
| |
| +-- (No) ---> (node 4) reject
|
+--> [\u0020-\u0FF] matches? -+-- (Yes) --> (node 5) accept
|
+-- (No) ---> (node 6) reject

When looking up for which action to take, you can scan this table from top
to bottom starting at the indicated row, but a binary lookup will be
possible if actions specify not only the first row, but also the last one to
test: the table needs to be sorted in this range by test type and parameters
value, and the number of rows to scan, or by encoding an additional row for
the end of scan, or by encoding specially the test type in the last test to
perform.

The "Test" column is easily represented by a tiny integer (or single byte)
given that you will not support many hundreds of test types; the Parameters
type is a single string (that can be compacted into an external bag so that
the same strings can be shared, and referenced as a pointer into that shared
bag of strings; one kind of parameter will be of special interest for text
recognizers: a single code point, which is necessarily within 0..0x10FFF,
and can be distinguised from string indices in the shared bag, by making
string numbers in that bag negative. The content of the Node(s) column need
not be stored (they remain above only to see what each row represents in the
first table). The recognizer will implicitly starts at row index 0.
This gives a simple table-based recognizer compiled as (pseudo C language):

/* complex parameters represented as external strings in a shared bag */
#define STRINGINDEX(n) (-(n)-1)
/* predefined test types */
#define TEST_SCRIPT 1
#define TEST_GC 2
#define TEST_RANGE 3
#define LAST_TEST(t) (-(t))
/* predefined action numbers */
#define ACCEPT -1
#define REJECT -2

Automata = {
{SCRIPT, PARAM_INDEX(0), ACCEPT, REJECT},
{PROPGC, PARAM_INDEX(1), ACCEPT, REJECT},
{LASTTEST(RANGE), PARAM_INDEX(2), ACCEPT, REJECT}
};
ParameterBag = {
  /* STRINGINDEX(0) */ "Latn",
  /* STRINGINDEX(1) */ "Ll",
  /* STRINGINDEX(2) */ "\u0020\u00FF"
};

Note also that many parameter strings in the bag above could be compressed
as well, because they are not all meaningful to all test types, for which
they take finite number of possible values (for example the gc property)
that are enumerable, and then representable directly as positive numbers,
that won't collide with the positive numbers used for matching one Unicode
code point (with a separate test type).

Each state for the automata is a set of rows in the Automata table; if the
Automata table is small enough (most regexps will generate such tables with
small sizes), it can be represented as a simple bitset, whose width in bits
is at most the number of rows in the Automata table. Otherwise you could use
set of integers where each integer of the set means a row number in the
Automata table.

The initial condition is associated to a state with the set {0}. A state
condition where an input is accepted has a integer set containing ACCEPT (it
should only include this state, unless the input string accepted by the
recognizer may be longer and would be the part of another match. The tests
are evaluated one after the other when scanning the Automata from top to
bottom: an element (partial input state) is taken out from the set of the
current state, and an output set is created from one of the two action
columns, according to the result of the test encoded in the row of the
Automata table associated to the partial input state.

All tests (up to the row marked as being the last one for the current input
state) are evaluated and their output sets are merged by union (this can be
made by creating an initially empty output set, and then including one
elements in the set after each test evaluation, unless the element is
already present in that set). Then if the ouput state contains the ACCEPT
element a match was found that can be returned by the recognizer; if the
REJECT element is present in that output, a non-match is also returned by
the recognizer.

All this is not complex, the complex part is during the compilation of the
input regexp syntax into a compact tree (compacting this tree may not even
be necessary if the regexps are small in their syntax). Some of the elements
in the regexp that require more work are those allowing repetitions: if they
must be counted, you need an extra variable to represent the fact that the
count is still not reached, or is reached but has still not its maximum
value. This will require more complex actions so that tests made on counter
values can be performed without implying that an input character is matched.

You should be able to define many test types, to test various properties
(such as testing characters ignoring their case for equality with a specific
Unicode character) This won't change the model for the representation of the
Automata table (you'll just need new test type constants, associated to a
function to evaluate the condition). The test function should be able to
advance to the next input character itself, unless this is a counter test
instead of a test that tries to match a character property.

There's no need to encode the ADVANCE condition in the actions columns of
the Automata, because tests of character properties (or membership to a
class of characters having this property) will necessarily advance to the
next if it matches (for multiple conditions, such as x AND y, this is
representable as NOT (NOT x OR NOT y), i.e. like [^[^x][^y]] using a
character class notation in regexps that support negated sets, and so this
requires multiple states that must be joined in the generated parser tree: a
junction is another test that does not advance the input, but is a test that
all partial states are present in the current set (its parameter in the
Automata is a set of integers)

With such regexp, you could easily recognize any language, including
computer languages that are specified using BNF syntax, or classical
ed/sed/vi regexps as supported by lex...

Supporting counters offers several alternatives: if counters are simple
(such as 0 or 1, or 0 to many, or 1 to many), they don't need to be stored
in the finite state of the automata, because this can be realized by
transforming the tree into a simple graph. If counters are complex like
{m,n} where m,n constants in the regexps can be arbitrarily large, they need
separate variables indexed by the counter level in the regexp.

Next message: Richard Cook: "Re: Composition of not included Chinese characters"
Previous message: John H. Jenkins: "Re: Composition of not included Chinese characters"
In reply to: Doug Ewell: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 20:32:35 CDT