From: Andy Heninger (andy.heninger@gmail.com)
Date: Fri Sep 21 2007 - 15:29:29 CDT
On 9/20/07, Mike <mike-list@pobox.com> wrote:
>
> > Issue #111 Proposed Update UAX #18: Unicode Regular Expressions
> >
> > http://www.unicode.org/reports/tr18/tr18-12.html
> >
> > This proposed update clarifies conformance requirements for "." and
> CRLF.
> > Public feedback is invited.
>
> I disagree with the MUSTs in the proposed text. In my implementation,
> whether "." matches newline sequences is independent of "multiline
> mode." Multiline mode affects the behavior of ^ and $, not .; in
> single line mode, they match only at the beginning or end of the text
> (or just before a final newline sequence); in multiline mode, ^ matches
> at the beginning of the string or after any newline sequence, and $
> matches before any newline sequence or at the end of the string.
This is my understanding also. Multiline mode only affects the behavior of
^ and $, and does not control whether "." matches a new-line sequence.
The separate option "DotAll" (Java terminology), or "Single Line Mode"
(classic regex terminology) controls whether "." matches a new line sequence
or not.
I think it might be best if, to the extent that we can, we avoid
descriptions and listings of specific regexp modes, and instead say that any
operations that are sensitive to newlines must recognize all of the Unicode
line-ending characters and sequences. The idea is to avoid any implication
that a list of regexp operations, modes or tests that we include is
complete, and to avoid having to describe too many things that don't
directly pertain directly to Unicode.
You can turn on the DotMatchesNewline and MultilineMatching options
> separately. As a side note, I implemented "." to match a default
> grapheme cluster, so A + ACUTE is treated as a single entity, and
> Hangul syllables are kept together (you can also specify them using
> \L+\V+\T* if you want).
I've been contemplating doing something along these lines also, but for more
than just ".", and for somewhat different reasons. Making the fundamental
unit of matching be a Grapheme Cluster, so that a plain "A" would not match
the "A" in "A + ACUTE" would be a clean way to define a canonically
equivalent match. Not too hard to explain, results are completely
independent of normalization form, match boundaries would never include part
of a composed character. Clusters would hang together in the pattern also,
so that qualifiers (*, +, ?, etc.) would apply to the preceding entire
cluster, not to the preceding code point.
===
Regarding the question of how to complement a [^set] that contains strings,
or grapheme clusters, or collation elements, or whatever we want to call
them, I am still struggling with what it means, and what makes sense. I'm
not sure the concept makes complete sense unless there is some interesting,
not too big "Universe" from which the original strings could be removed.
Maybe something language or locale sensitive, although in general I dislike
the idea of making matching be sensitive to such things.
-- Andy
>
>
This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 15:31:46 CDT