From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 06 2007 - 22:36:49 CDT
Mike wrote:
> Mark Davis wrote:
> > If sets don't have an implied ordering, do we need to require a POSIX
> > style longest match, which could be slow?
>
> In a set, I keep track of the strings by their length, so the longest
> match is always found. I don't think you want to backtrack and try a
> shorter string since the longer match is supposed to be treated as a
> unit....
Mark, the question is not "if sets don't have an implied ordering". By
definition a set is a unordered thing. There does not exist any ordered set.
Two good questions to ask then are:
(1) Ordering does not affect sets, but the way they are *constructed* using
set constructors like ranges. The ranges that you can insert within a set
constructor are ordered.
So please, don't mix the regexp expressed syntaxically as "[abc]", which is
a set constructor (more exactly a compound constructor), with the set it
builds. In other words "[abc]" is NOT a set but an operation that builds a
set from several constants ('a', 'b', 'c') and operators ('[', ']' and the
side-by-side concatenation, which is here idempotent to a function
application for the right side, and the composition of functions on its left
side).
(1) What are the ur-elements in this set?
(for the definition of what is an ur-element, and how it differs from an
element, look for example in Wikipedia: an element can contain things, an
ur-lement can only be part of a set. A synonym of "ur-element" is "atom"
within some mathematical branches that don't make distinctions between
"complete" and "incomplete" domains, but all incomplete domains, where atoms
and ur-elements are considered distinct, can be extended as being part of a
complete domain, where they don't differ.)
To determine which are the ur-element, the first thing you must determine is
the domain, and you must also determine if the domain is complete (in that
case you must admit the existence of the undetermined element, generally
named "unknown" or "undefined" or "bottom", as a member of ALL sets in this
domain) or not (in that case you need to discriminate between strict and non
strict pattern matching rules).
This is justified mathematically (theorem of Gödel) according to any
self-consistent and complete logic system: to make the system complete, the
existence of the undetermined element within the domain is absolutely
necessary. Its main property is that it is undecidable (either true or
false), infinite (if you use finite inference rules), omnipotent (present in
every set, including all meta-sets or the set of functions acting on
elements of the domain), matchable (using pattern-matching rules),
productive (it transforms a unproductive system that never ends within an
incomplete domain, into a productive system that is guaranteed to terminate
in a finite number of steps when working in the complete domain).
One of the consequence of the Theorem Gödel is that EVERY incomplete system
can be viewed as a consistent part of a complete system. You just need to
add the "bottom" element as a finite and well-behaved ur-element of your
domain).
Another consequence is that a complete domain is necessarily structured (and
not flat) so that the domain can be fully ordered, and that complete domains
are necessarily fully ordered (i.e. in a complete domain, there exists a
binary relation R between any two elements a and b of the domain so that the
binary relation (a R b) always takes a unique value within the domain; if
the domain is complete, it contains the Boolean ur-elements FALSE and TRUE,
and the set of Booleans is {FALSE,TRUE} which also includes implicitly the
BOTTOM value which is also a valid finite member of the empty set ! (Forget
the initial assumption that an empty set cannot contain any finite
ur-element; this does not contradict the concept of set cardinality which is
just a function from any set to the set of integers, for which the empty set
is mapped to zero : the BOTTOM value which is a member of the empty set is
just not counted, something that is possible because the complete domain
allows pattern matching on the BOTTOM value).
Let's return to our discussion: we need a "complete" system that describes
texts, strings, characters, and operations like concatenation, and an
operation between two strings (a regexp and any other text) that maps them
to a boolean. In other words, we want to define predicates in a way that
will be both self-consistent, will not induce contradictions (where the
predicate will not take both the value FALSE and TRUE), and will reply in a
finite time. For this to work as intended you need a BOTTOM value. In
regexps, the BOTTOM value is NOT "." (this matches all finite ur-elements of
the domain, but NOT the BOTTOM value itself).
And we need a clear definition of ur-elements in the domain: to be
consistent, this system must be built so that all strings, all characters,
all operations on them will be definable in terms of ur-elements:
(a) either you decide that characters are all atomic (i.e. ur-elements), and
then strings are simple ordered vectors of atomic elements. So the "."
regexp is will match a single character and nothing else, but every possible
character except BOTTOM. This is the interpretation of strings in classic C
and POSIX locales.
(b) either you decide that characters are NOT atomic, but are in fact
operators working in a larger domain (for example the domain of collation
elements). In that case strings are viewed as vectors of collation elements,
and the concatenation operation used to build texts is inconsistent, and
should not be part of the domain (and this effectively happens in the real
world). To solve the inconsistency, you need to redefine the operation of
concatenation in other terms, and you also need to map the "string" datatype
into something more powerful than a simple element: any string is then not
simple a vector, but a function of your domain, and the operation of
concatenating two strings can be mapped as the operation of composing two
functions of that domain... In that case, the "string" datatype is just a
restriction of the more general "text" datatype to the subset of ur-elements
where there's no interaction between the collation elements that you have
kept. This is the interpretation of texts within humane languages.
This makes a serious difference, but one way to think about the various
complete systems you can build with the latter definition of "string" is
that they are all particular instances of the "text" instance within a
locale context that defines all the possible interactions between characters
and strings in such basic operation like concatenation.
My personal view about regexps needed for standardization is that it is
necessary to define it in a way that is self-consistent, and complete enough
to be as powerful as a Turing system.
This is possible, but this requires a very strict mathematical definition,
using mathematical language theories, and other important results like the
Gödel theorem for completeness criteria, which is also extremely important
for being able to decide about computationability in finite time (or
resources).
I have built my regexp matcher algorithm using such formalism, and one of
the marvellous effects of it, is that the addition of BOTTOM within the
rules of my regexp matcher (so that it becomes matchable) has considerably
simplified the implementation. And exactly the same algorithm implementation
can work on every model for texts, i.e. (a) or (b) above, by unifying the
two and saying that ur-elements are defined according to a locale-context.
Another benefit, is that I no longer need and backtracking, and the system
is perfectly decidable in a finite number of steps (this is guaranteed
because I have built my system to be complete, simply by adding the
possibility of matching BOTTOM in a finite single step of decision).
Suppose that your initial domain is containing the set of characters [ab],
and any character from this set, and any string resulting from the
concatenation of these characters (the operation of concatenating strings is
also part of the domain). This domain is not decidable without adding to it
the two Booleans FALSE and TRUE (that allows writing predicates like
membership), and the basic operation on sets (union, intersection and
negation). The system is also not complete without adding another element to
it, that you can note '?'
Then the basic pattern matching rule for characters need to include:
* BOTTOM `matches` BOTTOM = TRUE
* BOTTOM `matches` . = BOTTOM
* . `matches` BOTTOM = BOTTOM
* 'a' `match` 'a' = TRUE
* 'b' `match` 'b' = TRUE
* _ `match` _ = FALSE
(note that what is noted by _ matches every ur-element except BOTTOM, this
is consistant with the role of "." in regexps, but this is another question,
because this depends on the semantic you assign to "." in regexp, where here
we are creating rules operating on every pair of members of the complete
system, including texts, regexps, and operations on regexps!)
Note that in this formal domain, the top 3 rules are needed because a
self-consistent and complete formal system must be able to describe itself
using its own language, where every function of two elements is described as
a ordered suites of pattern matching clause, with an implicit final clause
that you don't need to specify and that matches all the rest and assigns it
the BOTTOM value.
There exists some computer languages that allow defining such complete
system: Haskell is one of these, however it does not enforce or require the
completeness of functions, so you have to add an explicit pattern matching
clause to your functions definition, or choose between two alternate
syntaxes that make the function complete (non strict) or incomplete
(strict).
The remarkable thing of a complete language (like Haskell) is that it can be
fully described in its own language (but then it does not say how to compile
it in a finite time into a finite object, but Haskell does not need to
compile a program completely so that it becomes productive, because it uses
something else, notably "lazy evaluation" without which it would not work;
lazy evaluation is guided by a limited set of strict pattern matching rules,
allowing the language to work as intended).
(Note that Haskell does not define any syntax for using BOTTOM explicitly
and directly, its "_" notation is not BOTTOM, its implicit "forall
quantifiers" are not BOTTOM, but synonyms for "_", except that they are
given specific named for the purpose of matching different instances of "_"
; Haskell just uses BOTTOM implicitly within the way its pattern matching
system works for type inference, something that is enough for allowing
programs written in Haskell to build an expression that is idempotent to
BOTTOM, so that the absence of enforcement of strictness in Haskell allows
the language to be complete according to Gödel's theorem).
I prefer the "complete" approach when building a regexp matcher: it is much
simpler and allows building a language that is as powerful as the Turing
machine; any incomplete approach would not solve all problems, and could
require algorithms that are undecidable within a finite time because they
lack some possibility, such as not permitting matches on BOTTOM.
Within my implementation, BOTTOM is just implicitly a member of all
"character" classes like "[abc]" (including negated classes like "[^abc]",
or the "any character" class like ".") I don't even need a syntax within
Regexps to match it explicitly (although I could do it using some \notation,
however it is not needed for matching any regexp with texts that don't
contain it; but it could be used to match a regexp with another regexp, and
build a finite algorithm that will decide if two languages are semantically
equivalent).
But BOTTOM appears as being necessary for the implementation (and yes, this
really simplifies things a lot, because now I know some theorems about the
completeness of the language, which is now definable in a finite way, with a
finite number of inference rules and a finite number of ur-elements!)
This archive was generated by hypermail 2.1.5 : Sat Oct 06 2007 - 22:40:27 CDT