From: Mike (mike-list@pobox.com)
Date: Sun Sep 23 2007 - 11:06:31 CDT
>> As far as your other comments (copied below), the issue is as to what
> [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
>> • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
> "ñ", "ch", "ll", "rr"}.
>> • The set inversion would be the set of all other strings. So that would
> include "0", "A", ... but also "New York", and "onomotopaeic", and so on. An
> infinite set.
>
> Why do you assume such huge extension of the input universe ?
>
> The only needed thing is that the inversion set has to be universe minus the
> positive set, and that /./ has to include all possible positive sets, in
> such a way that {/[set]/, /[^set]/} is an exact partition of the universe of
> acceptable input units.
I think it is wrong to think of [^set] as being some 'universe' minus
[set]. The way I think of it is that [^set] matches anywhere [set]
does not match. As a simple example, consider the expression:
/^[\q{ch}].*/ # text must start with 'ch'
This will match the input strings "churro" or "chimichanga", but won't
match "caliente."
Now if we negate the set, we have the expression:
/^[^\q{ch}].*/ # text must not start with 'ch'
Then the matching behavior is just the opposite: "caliente" matches,
while "churro" and "chimichanga" do not. In my opinion, this is what
an end user would expect.
> You are not required to include in /./ all codepoints in the UCS, you may
> restrict /./ to include only assigned and valid characters....
One problem with restricting . to match only assigned characters is
that a text containing characters in a future version of Unicode
will cause false negatives. In my implementation, I provide \a as
a way to indicate you only want to match assigned characters (and
\A matches unassigned characters), and you can specify which version
of Unicode to use with \v{4.1}, for example.
Mike
This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 11:10:43 CDT