From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Sep 21 2007 - 11:32:47 CDT
> allowing multiple values in a property definition such as \p{gc=L|M|N} or
\p{nv>=10}.
Allowing multiple values is a nice way to compact the regex. Similarly, in
my implementation I actually allow a regex within the property value, so for
example have \p{name=/.*MARK.*/} to pick up all the Unicode characters with
"MARK" in their name. A bit squirrely, but very handy. We might mention some
of these techniques as possibilities.
As far as your other comments (copied below), the issue is as to what [^a-z
ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
- The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
"ñ", "ch", "ll", "rr"}.
- The set inversion would be the set of all other strings. So that
would include "0", "A", ... but also "New York", and "onomotopaeic", and so
on. An infinite set.
- So a match against /[^x...x]/ would be the equivalent of
/(?![x...x]) .*/, and match, for example, this entire email.
That would change the semantics of regex very substantially. Conversely, if
we define [^x...x] as equivalent to [[\u0000-\u0010FFFF] - [x...x]] it is
well-defined, and matches current regex usage for the cases where no
grapheme clusters are involved.
However, there may well be other useful alternatives that should be
considered. So perhaps you can set out your suggestions in more detail. (For
now, we can keep the discussion on this list; if it starts to get too boring
for others we can collected together the interested parties and do
off-line.)
from Mike:
A typical implementation of the inverse of a set containing
literal clusters simply removes those strings, thus
[^a-z ñ \q{ch} \q{ll} \q{rr}] is equivalent to [^a-z ñ].
I think this is bad implementation advice, and leads to strange
behavior. In the example given, the behavior will be correct since
all of the clusters begin with a letter also contained in the class.
However, if you consider a character class containing only clusters,
e.g. [^\q{ch} \q{ll} \q{rr}], simply removing the clusters will
result in an empty character class that matches -anything-. This
is incorrect behavior as it should not match the beginning of the
word "chile" for instance.
The way I implemented this was to create a "normal" character class
containing all the listed characters and grapheme clusters, and
then invert the result of the match operation. The classes above
would match "chile" in the first position, and thus return a "no
match" result.
On 9/20/07, Mike <mike-list@pobox.com> wrote:
>
> > As regards Mike's new concerns with the language regarding multiline
> > mode matching, I suggest that he post that to the feedback
> > form, and it will be rolled up into the feedback document
> > that will be considered by the UTC for this PRI.
>
> I have done that, and Rick has verified that the feedback
> was received. I also included more of my implementation
> details such as \m for combining marks, \i for ideographic
> characters, and allowing multiple values in a property
> definition such as \p{gc=L|M|N} or \p{nv>=10}.
>
> Mike
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 11:37:08 CDT