Re: New Public Review Issue: Proposed Update UTS #18

From: Mark Davis (mark.davis@icu-project.org)
Date: Sun Sep 23 2007 - 13:51:13 CDT

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

Previous message: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 9/23/07, Mike <mike-list@pobox.com> wrote:
>
> >> As far as your other comments (copied below), the issue is as to what
> > [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our
> reasoning.
> >> • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
> > "ñ", "ch", "ll", "rr"}.
> >> • The set inversion would be the set of all other strings. So that
> would
> > include "0", "A", ... but also "New York", and "onomotopaeic", and so
> on. An
> > infinite set.
> >
> > Why do you assume such huge extension of the input universe ?
> >
> > The only needed thing is that the inversion set has to be universe minus
> the
> > positive set, and that /./ has to include all possible positive sets, in
> > such a way that {/[set]/, /[^set]/} is an exact partition of the
> universe of
> > acceptable input units.
>
> I think it is wrong to think of [^set] as being some 'universe' minus
> [set]. The way I think of it is that [^set] matches anywhere [set]
> does not match. As a simple example, consider the expression:
>
> /^[\q{ch}].*/ # text must start with 'ch'
>
> This will match the input strings "churro" or "chimichanga", but won't
> match "caliente."
>
> Now if we negate the set, we have the expression:
>
> /^[^\q{ch}].*/ # text must not start with 'ch'
>
> Then the matching behavior is just the opposite: "caliente" matches,
> while "churro" and "chimichanga" do not. In my opinion, this is what
> an end user would expect.

The difficultly is masked by your use of .* afterwards.

Take /[\q{ch}]/. It matches all strings consisting of "ch". By your logic,
/[^\q{ch}]/ matches all strings that are not "ch", including, as I said,
"New York", and "onomotopaeic", and this entire email.

I think a clearer way of thinking about it is that [a-z \q{ch} \q{rr}] is
equivalent to ( [a-z] | ch | rr ) [actually to (?:[a-z]|ch|rr), but let's
forget about capturing for the moment to make things simpler.] Then the
question is what the 'inverse' of ( [a-z] | ch | rr ) is supposed to be
equivalent to. There are a variety of possibilities:

   1. [^a-z] -- fail with strings starting with a-z and otherwise advance
   by one code point
   2. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
   starting with a-z, ch, or rr, and otherwise advance by one code point
   3. (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z,
   ch, or rr, and otherwise advance by grapheme cluster
   4. (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
   strings starting with a-z, ch, or rr, and otherwise advance by tailored
   grapheme cluster (for traditional spanish, would include ch, ll, rr,
   and thus allow "ll")
   5. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}]* -- fail with strings
   starting with a-z, ch, or rr, and otherwise advance by any amount
   6. (?! ([a-z] | ch | rr) $) [\x{0}-\x{10FFFF}]* -- fail with strings
   exactly matching a-z, ch, or rr, and otherwise advance by any amount
   7. illegal -- you can't use ^ with sets containing strings.

#1 is the current approach in UTS18. #5 and #6 are the ones I was against.
They clearly wouldn't work; they would screw up any use of existing ranges
in Regex. #7 disallows the use of user-perceived characters like x+acute,
although it might be a good choice for the non-grapheme-cluster-recognizing
mode. #4 only works with language-sensitive modes, which are somewhat
tenuous. #2 and #3 are possibilities.

Note also that the UTC is proposing a somewhat more inclusive grapheme
cluster than the default, one that is still language-neutral. The proposed
update to UAX #31 will be going up soon.

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Previous message: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 13:53:36 CDT