RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Sep 24 2007 - 20:30:20 CDT

  • Next message: Richard Cook: "Re: Composition of not included Chinese characters"

    Doug Ewell wrote:
    > "Mike" <mike dash list at pobox dot com> wrote:
    > >> I'd just like to point out that a "[ ]" regular expression is defined
    > >> to match always exactly one character (if it matches at all).
    > >
    > > Correct. Except that a Spanish speaker would consider "ch" to be a
    > > single character even though you need two code points to represent it.
    >
    > I don't think it will ever really be feasible to define regular
    > expressions in terms of specific languages, to the point of treating
    > combinations of two or more base characters as a single matchable
    > "character" on the basis that speakers of language X consider the
    > combination to be a single "letter."

    The problem is not much about language-specific conditions, but about the
    expressiveness of regular expressions; in typical regexp matchers, the
    implementationis most often based on simple lexers that try finding matches
    for some limited subsets of language production rules.

    But if you reinterpret [set] only as an abbreviated form for writing (s|e|t)
    and use some matching algorithm similar to what is implemented in language
    compilers, without assuming the separation between lexers and scanners (by
    avoiding the classical separation between lex and yacc in Unix tools), but
    merging the two, you can builkd much more powerful matchers.

    The complexity here comes from the interpretation of negated sets: you need
    to define the universe of searches, or must be able to extend it according
    to your input regexps; this is possible, because the regexp itself defines
    its own universe about the matchable subunits it tries to accept or reject,
    and because any text containing subunits not used in the regexp would still
    be matchable by considering that all subunits absent from the regexp itself
    is part of a [^.] class, for which you don't need to have a exhaustive
    definitionofits elements (this class contains an infinite number of
    elements).

    So [^\q{string1}\q{string2}] makes sense. To compile this expression you
    need a recognizer for \q{string} which can be built by recognizing it
    character by character (so by including each letter /s/, /t/, /r/, /i/, /n/,
    /g/, /1/, /2/ in your finite universe, then interpreting the regexp as if it
    was written as (?not! ("s" "t" "r" "i" "n" "g" "1")|("s" "t" "r" "i" "n" "g"
    "2"), and that your compiler will interpret when building the finite-state
    automata as : (?not! "s" "t" "r" "i" "n" "g" ("1"|"2") ).

    The additional complexity in this type of automata is that each node in the
    parser tree can have positive and negative links to other nodes. Traditional
    parsers and scanners only have positive links, and try to map input letters
    of their alphabet only to one of the positive links (where only the absence
    of a match at each parsing step means that the regexp will not match.

    Another complexity that your compiler will have to manage: think about the
    meaning of expressions like [\p{script=Latn}\p{gc=Ll}\u0020-\u0FF]: it is
    not a simple union of separate character classes but actually means
    (\p{script=Latn} | \p{gc=Ll} | [\u0020-\u0FF]) where each alternative can
    match or not match independently.

    If you wanted to make a complete mapping to a single state number, you would
    have to build 3 dimensional table with 2 coordinates in each dimensions, one
    for "match", one for "does not match. The total number of states goes then
    from 2+2+2=2×3=6 to 2+2+2=2³=8 : this completely changes the order of
    magnitude for the problem, so rapidly the number of states to represent
    would explode.

    So the most effective way to build your recognizer is to allow each
    alternative to be recognized independently, and to build a parsing tree like
    this :

    |(node 0)
    +--> \p{script=Latn} matches? -+-- (Yes) --> (node 1) accept
    | |
    | +-- (No) ---> (node 2) reject
    |
    +--> \p{gc=Ll} matches? -+-- (Yes) --> (node 3) accept
    | |
    | +-- (No) ---> (node 4) reject
    |
    +--> [\u0020-\u0FF] matches? -+-- (Yes) --> (node 5) accept
                                  |
                                  +-- (No) ---> (node 6) reject

    This can summarized in a finite automata table:
    +----------------------------+----------------------------------+
    | Input | result |
    +----------------------------+----------------------------------+
    |Node | Test | Parameters | Action if true | Action if false |
    +-----+--------+-------------+----------------+-----------------+
    | (0) | script | Latn | goto node 1 | goto node 2 |
    +-----+--------+-------------+----------------+-----------------+
    | (0) | gc | Ll | goto node 3 | goto node 4 |
    +-----+--------+-------------+----------------+-----------------+
    | (0) | range | \u0020\u0FF | goto node 5 | goto node 6 |
    +-----+--------+-------------+----------------+-----------------+
    | (1) | accept | | goto node 0 | goto node 0 |
    +-----+--------+-------------+----------------+-----------------+
    | (2) | reject | | goto node 0 | goto node 0 |
    +-----+--------+-------------+----------------+-----------------+
    | (3) | accept | | goto node 0 | goto node 0 |
    +-----+--------+-------------+----------------+-----------------+
    | (4) | reject | | goto node 0 | goto node 0 |
    +-----+--------+-------------+----------------+-----------------+
    | (5) | accept | | goto node 0 | goto node 0 |
    +-----+--------+-------------+----------------+-----------------+
    | (6) | reject | | goto node 0 | goto node 0 |
    +-----+--------+-------------+----------------+-----------------+
    You can then replace node numbers by row numbers in a compressed table, when
    several nodes share the same tests, parameters and actions:
          +-------------------------------+----------------------------------+
          | Input | result |
          +-------------------------------+----------------------------------+
          |Node(s) | Test | Parameters | Action if true | Action if false |
          +--------+--------+-------------+----------------+-----------------+
    Row 0 | (0) | script | Latn | goto row 3 | goto row 4 |
          +--------+--------+-------------+----------------+-----------------+
    Row 1 | (1) | gc | Ll | goto row 3 | goto node 4 |
          +--------+--------+-------------+----------------+-----------------+
    Row 2 | (2) | range | \u0020\u0FF | goto row 3 | goto row 4 |
          +--------+--------+-------------+----------------+-----------------+
    Row 3 | (1) | accept | | goto row 3 | goto row 4 |
          +--------+--------+-------------+----------------+-----------------+
    Row 4 | (2) | reject | | goto row 3 | goto row 4 |
          +--------+--------+-------------+----------------+-----------------+
    Finally, you could easily eliminate the rows for accept and reject, because
    they are implicit and will exist in every recognizer. This can be
    represented by assigning virtual row numbers for these two actions, so that
    they don't even need to be stored in the table (e.g. -1 for accept, -2 for
    reject).
          +-------------------------------+----------------------------------+
          | Input | result |
          +-------------------------------+----------------------------------+
          |Node(s) | Test | Parameters | Action if true | Action if false |
          +--------+--------+-------------+----------------+-----------------+
    Row 0 | (0) | script | Latn | goto row -1 | goto row -2 |
          +--------+--------+-------------+----------------+-----------------+
    Row 1 | (1) | gc | Ll | goto row -1 | goto row -2 |
          +--------+--------+-------------+----------------+-----------------+
    Row 2 | (2) | range | \u0020\u0FF | goto row -1 | goto row -2 |
          +--------+--------+-------------+----------------+-----------------+
    This representation easily allows working with negated sets (you just have
    to swap the content of the two action columns in the associated rowof the
    table).

    When looking up for which action to take, you can scan this table from top
    to bottom starting at the indicated row, but a binary lookup will be
    possible if actions specify not only the first row, but also the last one to
    test: the table needs to be sorted in this range by test type and parameters
    value, and the number of rows to scan, or by encoding an additional row for
    the end of scan, or by encoding specially the test type in the last test to
    perform.

    The "Test" column is easily represented by a tiny integer (or single byte)
    given that you will not support many hundreds of test types; the Parameters
    type is a single string (that can be compacted into an external bag so that
    the same strings can be shared, and referenced as a pointer into that shared
    bag of strings; one kind of parameter will be of special interest for text
    recognizers: a single code point, which is necessarily within 0..0x10FFF,
    and can be distinguised from string indices in the shared bag, by making
    string numbers in that bag negative. The content of the Node(s) column need
    not be stored (they remain above only to see what each row represents in the
    first table). The recognizer will implicitly starts at row index 0.
    This gives a simple table-based recognizer compiled as (pseudo C language):

    /* complex parameters represented as external strings in a shared bag */
    #define STRINGINDEX(n) (-(n)-1)
    /* predefined test types */
    #define TEST_SCRIPT 1
    #define TEST_GC 2
    #define TEST_RANGE 3
    #define LAST_TEST(t) (-(t))
    /* predefined action numbers */
    #define ACCEPT -1
    #define REJECT -2

    Automata = {
     {SCRIPT, PARAM_INDEX(0), ACCEPT, REJECT},
     {PROPGC, PARAM_INDEX(1), ACCEPT, REJECT},
     {LASTTEST(RANGE), PARAM_INDEX(2), ACCEPT, REJECT}
    };
    ParameterBag = {
      /* STRINGINDEX(0) */ "Latn",
      /* STRINGINDEX(1) */ "Ll",
      /* STRINGINDEX(2) */ "\u0020\u00FF"
    };

    Note also that many parameter strings in the bag above could be compressed
    as well, because they are not all meaningful to all test types, for which
    they take finite number of possible values (for example the gc property)
    that are enumerable, and then representable directly as positive numbers,
    that won't collide with the positive numbers used for matching one Unicode
    code point (with a separate test type).

    Each state for the automata is a set of rows in the Automata table; if the
    Automata table is small enough (most regexps will generate such tables with
    small sizes), it can be represented as a simple bitset, whose width in bits
    is at most the number of rows in the Automata table. Otherwise you could use
    set of integers where each integer of the set means a row number in the
    Automata table.

    The initial condition is associated to a state with the set {0}. A state
    condition where an input is accepted has a integer set containing ACCEPT (it
    should only include this state, unless the input string accepted by the
    recognizer may be longer and would be the part of another match. The tests
    are evaluated one after the other when scanning the Automata from top to
    bottom: an element (partial input state) is taken out from the set of the
    current state, and an output set is created from one of the two action
    columns, according to the result of the test encoded in the row of the
    Automata table associated to the partial input state.

    All tests (up to the row marked as being the last one for the current input
    state) are evaluated and their output sets are merged by union (this can be
    made by creating an initially empty output set, and then including one
    elements in the set after each test evaluation, unless the element is
    already present in that set). Then if the ouput state contains the ACCEPT
    element a match was found that can be returned by the recognizer; if the
    REJECT element is present in that output, a non-match is also returned by
    the recognizer.

    All this is not complex, the complex part is during the compilation of the
    input regexp syntax into a compact tree (compacting this tree may not even
    be necessary if the regexps are small in their syntax). Some of the elements
    in the regexp that require more work are those allowing repetitions: if they
    must be counted, you need an extra variable to represent the fact that the
    count is still not reached, or is reached but has still not its maximum
    value. This will require more complex actions so that tests made on counter
    values can be performed without implying that an input character is matched.

    You should be able to define many test types, to test various properties
    (such as testing characters ignoring their case for equality with a specific
    Unicode character) This won't change the model for the representation of the
    Automata table (you'll just need new test type constants, associated to a
    function to evaluate the condition). The test function should be able to
    advance to the next input character itself, unless this is a counter test
    instead of a test that tries to match a character property.

    There's no need to encode the ADVANCE condition in the actions columns of
    the Automata, because tests of character properties (or membership to a
    class of characters having this property) will necessarily advance to the
    next if it matches (for multiple conditions, such as x AND y, this is
    representable as NOT (NOT x OR NOT y), i.e. like [^[^x][^y]] using a
    character class notation in regexps that support negated sets, and so this
    requires multiple states that must be joined in the generated parser tree: a
    junction is another test that does not advance the input, but is a test that
    all partial states are present in the current set (its parameter in the
    Automata is a set of integers)

    With such regexp, you could easily recognize any language, including
    computer languages that are specified using BNF syntax, or classical
    ed/sed/vi regexps as supported by lex...

    Supporting counters offers several alternatives: if counters are simple
    (such as 0 or 1, or 0 to many, or 1 to many), they don't need to be stored
    in the finite state of the automata, because this can be realized by
    transforming the tree into a simple graph. If counters are complex like
    {m,n} where m,n constants in the regexps can be arbitrarily large, they need
    separate variables indexed by the counter level in the regexp.



    This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 20:32:35 CDT