Re: New Public Review Issue: Proposed Update UTS #18

From: Mark Davis (mark.davis@icu-project.org)
Date: Sun Sep 23 2007 - 13:51:13 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    On 9/23/07, Mike <mike-list@pobox.com> wrote:
    >
    > >> As far as your other comments (copied below), the issue is as to what
    > > [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our
    > reasoning.
    > >> • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
    > > "ñ", "ch", "ll", "rr"}.
    > >> • The set inversion would be the set of all other strings. So that
    > would
    > > include "0", "A", ... but also "New York", and "onomotopaeic", and so
    > on. An
    > > infinite set.
    > >
    > > Why do you assume such huge extension of the input universe ?
    > >
    > > The only needed thing is that the inversion set has to be universe minus
    > the
    > > positive set, and that /./ has to include all possible positive sets, in
    > > such a way that {/[set]/, /[^set]/} is an exact partition of the
    > universe of
    > > acceptable input units.
    >
    > I think it is wrong to think of [^set] as being some 'universe' minus
    > [set]. The way I think of it is that [^set] matches anywhere [set]
    > does not match. As a simple example, consider the expression:
    >
    > /^[\q{ch}].*/ # text must start with 'ch'
    >
    > This will match the input strings "churro" or "chimichanga", but won't
    > match "caliente."
    >
    > Now if we negate the set, we have the expression:
    >
    > /^[^\q{ch}].*/ # text must not start with 'ch'
    >
    > Then the matching behavior is just the opposite: "caliente" matches,
    > while "churro" and "chimichanga" do not. In my opinion, this is what
    > an end user would expect.

    The difficultly is masked by your use of .* afterwards.

    Take /[\q{ch}]/. It matches all strings consisting of "ch". By your logic,
    /[^\q{ch}]/ matches all strings that are not "ch", including, as I said,
    "New York", and "onomotopaeic", and this entire email.

    I think a clearer way of thinking about it is that [a-z \q{ch} \q{rr}] is
    equivalent to ( [a-z] | ch | rr ) [actually to (?:[a-z]|ch|rr), but let's
    forget about capturing for the moment to make things simpler.] Then the
    question is what the 'inverse' of ( [a-z] | ch | rr ) is supposed to be
    equivalent to. There are a variety of possibilities:

       1. [^a-z] -- fail with strings starting with a-z and otherwise advance
       by one code point
       2. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
       starting with a-z, ch, or rr, and otherwise advance by one code point
       3. (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z,
       ch, or rr, and otherwise advance by grapheme cluster
       4. (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
       strings starting with a-z, ch, or rr, and otherwise advance by tailored
       grapheme cluster (for traditional spanish, would include ch, ll, rr,
       and thus allow "ll")
       5. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}]* -- fail with strings
       starting with a-z, ch, or rr, and otherwise advance by any amount
       6. (?! ([a-z] | ch | rr) $) [\x{0}-\x{10FFFF}]* -- fail with strings
       exactly matching a-z, ch, or rr, and otherwise advance by any amount
       7. illegal -- you can't use ^ with sets containing strings.

    #1 is the current approach in UTS18. #5 and #6 are the ones I was against.
    They clearly wouldn't work; they would screw up any use of existing ranges
    in Regex. #7 disallows the use of user-perceived characters like x+acute,
    although it might be a good choice for the non-grapheme-cluster-recognizing
    mode. #4 only works with language-sensitive modes, which are somewhat
    tenuous. #2 and #3 are possibilities.

    Note also that the UTC is proposing a somewhat more inclusive grapheme
    cluster than the default, one that is still language-neutral. The proposed
    update to UAX #31 will be going up soon.



    This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 13:53:36 CDT