From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 14:53:51 CDT
Jonathan Coxhead wrote:
> I'd just like to point out that a "[ ]" regular expression is defined
> to
> match always exactly one character (if it matches at all).
Why ? This is just an historic limitation in old ASCII-based
implementations. From a user perspective, the [] notation is just a
convenient short way to write an alternation between multiple strings making
up what the user MAY perceive as a single character. If you want to be fait
with any language, you need to admit that the restriction of [] to
single-codepoint matches is not relevant.
The fact that [] is more efficient in regexp engines than a notation using
(...|...|...) is just a matter of implementation. My opinion is that such
performance difference is a defect of the implementation, i.e. a bug. From
the user's perspective, the meaning is not altered.
Then, there's the problem of regexps like [àé] : the set contains composite
characters; to accept such shortcut, it has to remain meaningful, even if
the regexp is in NFD form, without having to write it explicitly as:
[\q{à}\q{é}] (otherwise the set would include also [ae] without the accents,
and would also include the accents separately).
So, in a shortcut notation like [àé], you need an additional rule to
disambiguate the meaning: you need to parse the set using default grapheme
cluster boundaries, so that the characters considered as unbreakable units
are the combining sequences (and all their canonical equivalents). So it'sup
to the implementation to make sure that [àé] is effectively a shortcut
completely equivalent to (à|é). If only the precomposed characters must be
matched (and not the canonically equivalent decomposed strings, then you
need to specify the regexp in a way that can't be interpreted as canonically
equivalent.
If a regexp string contains "à", it designates all its canonical
equivalents; to match only the precomposed "à", you would need a notation
specifying that, like "\C{à}" for matching the character converted to NFC
form only, excluding all other canonical equivalents. But then, howto match
characters that are excluded from recomposition in NFC form?
* May be this notation should still allow the recompositions (so that
compatibility characters become matchable)
* Or more safely, by using another way to specify these compatibility
characters (like the \uxxxx notation, which can't be interpreted as meaning
something else than the designated character).
Another problem: what is the meaning of [a-e] ? in a language-dependant
perspective, it should match all characters between a and e in the
language's alphabet. This means that it should match not only single
graphemes, but also the possible digraphs (like "ch" or "c’h"), i.e. the
collation elements.
But I think that regexps should not be interpreted ambiguously, unless the
application knows which locale the user expects by default. Another
mechanism should be available in regexps to override the default locale.
Two possibilities:
* introduce locale specifiers in external regexp flags (remember the flags
in Perl or PHP or vi/ed/sed after the final slash delimiting the regexp)
* include in the regexp syntax itself a locale specifier for specific parts
of the regexps like:
(?locale=br![a-e]) which means that the set is interpreted within the
Breton locale where (ch) and (c’h) are part of the alphabet, between (b) and
(d), but NOT (c): a isolated c would NOT be matched in this locale, unless
you use a extended locale that also includes (c) within the Breton alphabet.
To specify the historic behaviour, you would simply use (?locale=C!...) or
(?locale=POSIX!...) for example to ignore the user's default locale.
This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 14:55:37 CDT