From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 22 2007 - 16:13:19 CDT
Hans Aberg [mailto:haberg@math.su.se]
> On 22 Oct 2007, at 22:16, Philippe Verdy wrote:
>
> > Note that L may contain strings containing strings like a base
> > letter followed by a diacritic, which is canonically equivalent to
> > its precomposed form. Would only the precomposed form would be
> > allowed in [L] ? The definition of "length" is not precise enough.
> > Forme the composed nas precomposed letters should behave
> > identically, ans so their "length" should be 1 in both case. If so,
> > then [L] will contain BOTH the precomposed letter and the sequence
> > of a letter and a diacritic.
>
> Read all the stuff. There are different constructions.
Try to reformulate your stuff by avoiding the confusion between the regexps
and the strings it matches.
For me, a regexp is not a string, but a function mapping any text to a set
of matches. But for simplicity we need another object, i.e. a function that
returns true only if there's a full match and just returns true or false
instead of a set of matches.
Let's define it so that:
Match_r : String -> Boolean, where r is a character (used as a regexp but
wherer has no special meaning)
Match["a"] (x) = true, if x="a"
Match["a"] (x) = false, otherwise
(Here it is just a function that compares canonically equivalence of
characters to the character "r")
This definition is consistent with the Unicode process conformance rule for
its argument. But it does not indicate anything about the syntax used
effectively in the regexp meant between the brackets
Match["a\u0301"] ("a") = false
Match["a\u0301"] ("\u00E1") = true
Match["\u00E1"] ("a\u0301") = true
(note above: there's no "\u" notation in the source, this is a way to refer
to the actual character only for the definition)
The definition of a Regexp RE is that it will return a set of matches from
its input text T argument, each returned matched being defined by the
association of the source text T and a interval of positions (a_i,b_i)
within that text, so that the substring extracted from the source text with
this interval will satisfy:
Match[RE]( T.substring(a_i,b_i) ) = true.
Then retry formulating your langage. There's a clear separation between the
language of regexps and the language of strings that it matches, because
they don't use the same symbols:
- the language of strings is U* where U is the Unicode character set, which
defines two equivalence relations: = (strict equality) and ~ (canonical
equivalence).
- the language of regexps is (U union R)* where R is the set of regexp
operators, and U designates literals). This language has NO canonical
equivalence, except when they are explicitly defined by an operator of R;
- there's a third language, which results from a surjection of the previous
language into U*, and this function is the syntax of regexps; and this is
the language that we use to specify regexps like "x.*" (where "." and "*"
are not interpreted as literal characters, but as operators defining
classes); there are tons of such languages, but here we don't matter match
about the syntax, we just choose one conventionally but any other language
would do the same (the difference is just on the syntax, not its
interpretation).
This archive was generated by hypermail 2.1.5 : Mon Oct 22 2007 - 16:16:05 CDT