From: Mike (mike-list@pobox.com)
Date: Sun Sep 23 2007 - 12:03:51 CDT
Philippe Verdy wrote:
> The first intutitive approach to what [^set] means is that it should match
> everywhere [set] does not match, and [set] should match everywhere [^set]
> doest not match, i.e. they should be perfect complementary of each other.
Sorry, Philippe, I just responded to your later message, and didn't
realize you had said this. This is exactly how I implemented [^set]
in my code, and as you say, it is intuitive. We should strive to
have intuitive behavior; the opposite of intuitive is 'obtuse' or
'unnatural'.
> But already, they are aren't perfect complements because both will exclude
> line terminators in multiline mode.
This is not true; [^abc] should match a line terminator. Unless
you do something like [[^abc] & \p{L}].
> Now if you accept digraphs or grapheme clusters in [set], you should accept
> them also in [^set] and "." should also include all digraphs and grapheme
> clusters, but this means that "." will need to include all possible texts,
> because digraphs are not limited in size. As this seems unreasonable
> (because it would make counting the number of matches with "." impossible to
> perform), it seems reasonable to exclude the possibility of using digraphs
> in [set].
I played around with the ability to add digraphs to "." and came up
with two methods. The first would be to specifically list them using
syntax such as:
(?.ch.ll.rr) # . now matches "ch" "ll" and "rr" as single entities
Or you could specify a locale:
\l{es} # adds digraphs from Spanish locale to .
I don't yet support locales in my code, but I have reserved \l for
that purpose.
> So the idea of implanting regexps by making them find matchs in the NFD
> transformation of the input text is good as it creates a conforming process.
> The bad thing is that E WITH ACUTE is no more a single character and is then
> absent from the "." universe and can't be part of [set] and [^set].
In my code, both the pattern and input text are converted to NFD, and
"." will match E WITH ACUTE as a single character (two code points).
This is done by keeping track of where the default grapheme cluster
boundaries are.
> Another possibility is to include in the "." universe the NFD transformation
> of every code point of the UCS, in such a way that the sequence <C,
> COMBINING ACUTE ACCENT> is still counted as 1 unit, but <C, COMBINING ACUTE
> ACCENT, COMBINING CEDILLA > will be counted as 2 "." units (but then
> remember that "." is sensitive to Unicode versions).
Using grapheme cluster boundaries is an easier way to do this, and
allows you to match any combining character sequence, whether there
is a code point assigned to it or not. You're correct that this
depends on the Unicode version since the grapheme cluster boundaries
depend on it.
> But then, should a search for <C WITH COMBINING ACUTE ACCENT>, equivalent to
> a search for <C, COMBINING ACUTE ACCENT> in NFD form, will easily match the
> text <C WITH COMBINING ACUTE ACCENT, COMBINING CEDILLA>, but should it match
> the text encoded as <C WITH CEDILLA, COMBINING ACUTE ACCENT>, which is
> canonically equivalent? If the intent is to produce a Unicode conforming
> process, it should be yes. So matches will be for two non-contiguous
> subtrings in the scanned text, excluding the CEDILLA part !
I am experimenting with requiring a match to start and end on grapheme
cluster boundaries, thus a search for C WITH ACUTE will not match
C WITH ACUTE + CEDILLA or C WITH CEDILLA + ACUTE.
I have a problem I need to figure out, though, and that is if you want
to add \m* to cause a match to occur (\m* means 'plus any other marks').
If the NFD is <C, CEDILLA, ACUTE> and you try to match C + ACUTE + \m*,
the intervening CEDILLA causes this not to match; I need to figure out
a way to cause this to match....
Mike
This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 12:06:44 CDT