From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 00:50:00 CST
> > So a regexp /ch/ or /\q{ch}/ matches the same
> > thing, the former being more efficient than the second one because it
> does
> > not alter the input universe.
>
> The \q{} syntax is only valid (or needed) within square brackets, so
> /\q{ch}/ is not a valid regular expression. Again, merely using \q
> should not have side effects.
If I used // delimiters it's only to show where the regexp parts are within
the message. But also to mean that it makes no difference if a "collation
element" (not a "character class" as you call it) if present within []
brackets or outside of it, provided that it matches as a single item with
"."
In that case, you can of course remove \q{} outside of [] because the
complete regexp: /\q{something}/ will match exactly the same thing as the
complete regexp: /something/.
If I used \q{} in a such a way, it was just to show where the delimitation
of units (that would be matched by a single ".") would occur, without
introducing a new notation, because this use was consistant with the use of
the same unit as part of a [set], allowing me to say that [set] becomes
completely equivalent to (s|e|t) and [a-z\q{ch}] becoming also completely
equivalent to (a|b|c|...|y|z|\q{ch})
(note that the "..." ellipsis just above is a local shortcut, not intended
to be three occurrences of the "." special character)
But note that with my notation /\q{ch}./ would NOT be equivalent to /ch./
- the latter regexp will match only 3 characters: /c/ followed by /h/
followed by what /./ matches by default (i.e. [\u0000-\u10FFF] minus the set
of line terminators, which depends on the single line or multi-line mode in
effect, and that I'll note \R).
- the former regexp extends the input universe (matched by ".") by making it
[\u0000-\10FFF\q{ch}] (so that it now contains /c/ or /h/ or the sequence
/ch/).
With this notation, then /[^\q{ch}]/ makes sense because it is effectively
matches now any element of the set /./ minus line terminators and the
sequence /ch/, but /[^\q{ch}]./ will not match the string "ch", even though
it could be interpreted as the sequence /c/ followed by /h/, where:
- the first /c/ would match [^\q{ch}] defined as /./ minus /\q{xh}/, i.e.
now the extended set U = ([\x0-\x10FFF] minus (\R|\q{ch})) where /c/ is
effectively present
- the second /h/ would match /./ i.e. the same set U as above
The reason being that the input universe U has been extended, and the inout
lexer will return a single unit \q{ch} for every sequence /ch/ occurring in
the input file but no separated /c/ and /h/ units in that order.
I spoke about the way to limit the extension of the input universe: I said
that it would depend on the input locale, that could also be delimited in an
expression:
So /./ in the default C locale would be the set [\x0-\x10FFF] of all Unicode
codepoints minus \R, by default unless it is extended by occurrences of
\q{strings} within the same locale context. One can revert back to the
default set by delimiting locales in a way similar to /(?locale=C:.)/
If you start tour regexp engine matches within a Spanish locale /./ will
match by default the same thing as /{?locale=es:.)/ but you could still use
simply the regexp /./ if this locale is simply specified externally or this
is the user's locale. In this Spanish locale (just a sample), every
occurrence of /ch/ would be recognized according to the locale definition of
its alphabet, as a single unbreakable unit, even though it would still
contain the separate /c/ and /h/ items when they are matchable isolately
only when they don't occur in sequence in the input file(s) to scan.
The proposed notation is then not based on individual characters: the /./
regexp as well as [set] and [^set] will not necessarily match a single
"character" but a single "collation element" in the current locale context.
The current locale context is modified everywhere in the regexp by the
inclusion of new collation elements with \q{}, except in parts of the regexp
that are explicitly delimited by an explicit locale delimitation using:
/(?locale=<code>:<rest of regexp>)/
What this means is that:
- An English user (as perceived by its user locale defining the default
context of search) searching for /[a-cd-z]./ will get a match for the string
"char", because its alphabet makes no difference between [a-z] and [a-cd-z]
- A Spanish user will not get a match in "char" witch the same regexp
/[a-cd-z]./, if the Spanish locale is defined with the ordered collation
elements {[a-z], \q{ch}} or as the ordered collation elements {[a-c], q{ch},
[d-z]}
- The English (or Spanish user) that wants explicitly the behaviour of the
regexp matches found in the Spanish locale, can specify it explicitly by
using the regexp with the desired locale: /(?locale=es:[a-cd-z].)/ (or by
specifying the Spanish locale explicitly in an external parameter or flag to
the regexp engine
- if these users want to get the default behaviour, they just need the same
thing by replacing "es" by "C" in the the previous regexp, or in the
explicit locale parameter when invoking the regexp compiler engine.
- Mixed locales are possible to limit the extension (by the \q{...}
notation) of the active ordered set of collation elements. In an expression
like /(?locale=C:[^\q{ch}])./, the last occurrence of "." will only match
one collation element of the default user's locale, because the extension
made by \q{ch} in the part of the regexp explicitly delimted by the locale
designation does not extend before or after this delimitation everywhere
else in the regexp: the explicit locale designation defines a scope for the
modification of the current ordered set of collation elements.
Now it could be time to define how the use of a collation element \q{ch}
modifies the current ordered set of collation elements: if \q{ch} is already
contained in the current set (independently of its order), then the ordered
set is not altered.
Otherwise, the ordered set is extended by adding the collation element at
end of the ordered set, with a primary difference.
If this position in the ordered set is not suitable, one could specify
another order by introducing a collation rule in the locale delimitation
scope, for example (?locale=C;ch<d: ...) which first redefines the current
set of collation element as the default set used in the C locale (i.e. only
the binary order of singe-codepoint, then performs a collation tailoring by
introducing the collation element \q{ch}, a contraction, before \q{d}, a
collation element with a single codepoint, using primary difference.
The "C" locale (binary order of single codepoints) would be predefined, as
well as a "U" locale for the Default Unicode order defined in the DUCET
(which already contains some special collation elements defined as
contractions).
The current set of collation is element is ordered and not a simple
unordered set, so that ranges used in regexps like /[a-z]/ make sense (this
"set" notation using a range does not mean just the single characters whose
code points are between those assigned to "a" and "z" inclusively, but the
subset of collation elements that are ordered between the collation elements
"a" and "z".
More precisely, the current set of collation elements is defined as a
tailored collation, rather than simply an ordered set of collation elements:
every local switch to another locale changes (locally within the delimited
part of the regexp) the current tailored collation to the collation
associated to that locale.
This refined definition is normally not needed for simple regexps that try
to match /[set]/ or /[^set]/, including when the sets contain ranges like in
/[a-z]/ or /[^a-z]/. This will be needed when trying to match more complex
ranges based on multiple collation elements.
One could want finding matches for strings that collate between two strings
in a given locale. This would require another notation than ranges in []
sets. For example one could use (?range:string1:string2:regexp3) which does
not modify the current ordered set of collation elements, but where the
three parameters "string1" and "string2" and "regexp3" are parsed according
to the current tailored collation.
For example to match all 3 letters words in Spanish between c and d
(inclusive, but "c" and "d" won't match because they are not 3 letters) one
would use /(?locale=es:(?range:c:d:...))/
A Spanish user would just enter the simpler regexp /(?range:c:d:...)/
without needing to qualify its search with a delimited locale scope.
Note here that there are three dots, each one matches a single Spanish
collation element, according to the Spanish collation which is in scope, but
the whole expression will match only those strings that collate between "c"
and "d" inclusively (the boundaries of a string range do not have to match
the third regular expression, and if they don't they won't be possible
matches)
A notation similar to the (?range:) notation could be done for specifying
unbounded minimum or maximum boundaries.
- given that the smallest minimum boundary is the empty string:
(?range::d:.*) will match every string lower than or equal to the upper
bound "d", in the current tailored collation.
- given that the largest maximum boundary is an infinite string there would
be no way to specify it, but one could simply remove one parameter to the
range, so that (?range:c:.*) will just specify the lower bound "c" and the
last parameter ".*" will be the regexp defining candidate matches to check
with the range condition, and will match any string higher than or equal to
"c" in the current tailored collation.
To exclude the boundaries from a range, one could use the "^" special
character for this purpose (which could be quoted if needed for interpreting
it literally). It could be used for both the upper or lower bound, including
the empty string in the lower bound; beside this special meaning of "^" and
the interpretation of escapes, the upper and lower bounds in string range
conditions are not regexps but string literals.
With such system, we have now defined regexps that make sense linguistically
for every locale, and a system that allow regexps to be also interpreted
without depending on the user locale,when this is needed, and a system that
allows matching additional collation elements when needed.
The complexity here is not in the implementation of the classical finite
state automata used in regexp compilers and matchers: it is within the
dynamic construction and use of multiple tailored collations (but only one
collation is used in each state of the finite state automata), and in its
input lexer, because the same input file may need to be "lexed" into
different collation elements, in parallel, according to the current states
of the finite state automata, each state of the automata referencing its own
input lexer depending on the locale in scope.
---- Note that in regexp matchers, the finite state automata built from the regexp has multiple active states in their state transition graph, and the number of active states varies after each input because every transition in the graph is possible and must be tested in simultaneously, due to aggregates like "+", "*", "?" or "{m,n}" counters, but also because of alternatives with "|" and "[set]" which may not be exclusive; on the opposite, the construction of the graph of transitions and their condition, is trivial from the regexp itself. In the transition graph, there's only one node representing the initial state and only one node representing a final matching state (however when a match is found, this may not be the only active node, and if multiple matches must be searched one must add a null transition from the final node to the initial node, but keep the existing list of active nodes when continuing the search). There is also often an additinial final state meaning the absence of match in the current state (this node has no successor, the only transition from this terminal node requires an unconditional read of one default collation element from the input before looping unconditionally back to the initial node). There are algorithmic methods to optimize the traversal of the transition graph (by the match method fed with the text input): * they try reducing the number of active nodes and the number of transitions (between two connected nodes) to test at each step, but many of these local optimization will explode the number of nodes needed in the transition graph; * so many implementations of regexp matchers don't perform such local optimizations (which may require too much memory and computing time for returning the optimum but very large graph): * such optimization of the transition graph is normally performed in the regexp compilation method, that takes a regexp (plus optional global flags for its specifying its semantic and additional restrictions or case-insensitive extensions) and returns the compiled transition graph; * for details, look into the wellknown Aho-Seti-Ullman book or other good reference book on the implementation of compilers, finite-state automatas and regexps...
This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 00:53:30 CST