From: Mike (mike-list@pobox.com)
Date: Sun Sep 30 2007 - 16:34:48 CST
> The fact that [] is more efficient in regexp engines than a notation using
> (...|...|...) is just a matter of implementation. My opinion is that such
> performance difference is a defect of the implementation, i.e. a bug.
The term "bug" should refer to a situation where the wrong result
is obtained. Software should be -correct- first and -fast- second.
If it's fast enough, there is always something else more important
to spend time on.
> So, in a shortcut notation like [àé], you need an additional rule to
> disambiguate the meaning: you need to parse the set using default grapheme
> cluster boundaries, so that the characters considered as unbreakable units
> are the combining sequences (and all their canonical equivalents). So it'sup
> to the implementation to make sure that [àé] is effectively a shortcut
> completely equivalent to (à|é).
I'm not sure I agree that you want to look for default grapheme
cluster boundaries inside a character class. If you list a few
Hangul L jamos, they will all be jumbled together into a single
cluster, for example. Also, how would you interpret [a\u0300]?
As (a|\u0300) or (a\u0300)?
> Another problem: what is the meaning of [a-e] ? in a language-dependant
> perspective, it should match all characters between a and e in the
> language's alphabet. This means that it should match not only single
> graphemes, but also the possible digraphs (like "ch" or "c’h"), i.e. the
> collation elements.
I think that [a-e] should -always- mean the five code points,
U+0061 through U+0065, regardless of locale. Even if you specify
a locale: \l{es}[a-e], I think it should still mean the same five
code points, and not add other characters such as "ch" (since it
is a character in the Spanish locale). In Hawaiian, [a-z] would
mean [aeiouhklmnpw], which would certainly cause trouble.
I can see that it might be useful to be able to do this, but I
would suggest that new syntax should be used to avoid confusion.
Mike
This archive was generated by hypermail 2.1.5 : Sun Sep 30 2007 - 16:39:44 CST