Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Sun Sep 23 2007 - 11:06:31 CDT

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

Previous message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
In reply to: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>> As far as your other comments (copied below), the issue is as to what
> [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
>> • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
> "ñ", "ch", "ll", "rr"}.
>> • The set inversion would be the set of all other strings. So that would
> include "0", "A", ... but also "New York", and "onomotopaeic", and so on. An
> infinite set.
>
> Why do you assume such huge extension of the input universe ?
>
> The only needed thing is that the inversion set has to be universe minus the
> positive set, and that /./ has to include all possible positive sets, in
> such a way that {/[set]/, /[^set]/} is an exact partition of the universe of
> acceptable input units.

I think it is wrong to think of [^set] as being some 'universe' minus
[set]. The way I think of it is that [^set] matches anywhere [set]
does not match. As a simple example, consider the expression:

/^[\q{ch}].*/ # text must start with 'ch'

This will match the input strings "churro" or "chimichanga", but won't
match "caliente."

Now if we negate the set, we have the expression:

/^[^\q{ch}].*/ # text must not start with 'ch'

Then the matching behavior is just the opposite: "caliente" matches,
while "churro" and "chimichanga" do not. In my opinion, this is what
an end user would expect.

> You are not required to include in /./ all codepoints in the UCS, you may
> restrict /./ to include only assigned and valid characters....

One problem with restricting . to match only assigned characters is
that a text containing characters in a future version of Unicode
will cause false negatives. In my implementation, I provide \a as
a way to indicate you only want to match assigned characters (and
\A matches unassigned characters), and you can specify which version
of Unicode to use with \v{4.1}, for example.

Mike

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Previous message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
In reply to: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 11:10:43 CDT