From: Mike (mike-list@pobox.com)
Date: Mon Sep 24 2007 - 07:59:32 CDT
>> I played around with the ability to add digraphs to "." and came up
>> with two methods. The first would be to specifically list them using
>> syntax such as:
>
> I'd just like to point out that a "[ ]" regular expression is defined to
> match always exactly one character (if it matches at all).
Correct. Except that a Spanish speaker would consider "ch" to be a
single character even though you need two code points to represent it.
> You can write "[abcdef]" as "(a|b|c|d|e|f)" if you like. You can also write
> "(a|bb|ccc|dddd|eeeee|ffffff)", but there is no form using "[ ]" to match the
> same thing.
There is a mechanism (not sure of the origin) for specifying that
a sequence of code points are to be treated as a character in a
character class: [\q{ch}]. [abcdef\q{ch}] matches the same thing
as (a|b|c|d|e|f|ch)
> "[ ]" exists primarily as an optimisation, because matching 1 character
> against a set is a fast operation, whereas checking against an unknown number of
> alternatives of potentially varying lengths ("( | )") is expensive.
Yes, and it is also more readable (though the \q construct lowers
readability): [0-9] is much better than (0|1|2|3|4|5|6|7|8|9).
> So a sequence specified like [^ ] could never match a whole message, or the
> string "New York": it could only match a single character.
I have always agreed with this.
> What exactly this means in the context of Unicode is a different matter, but
> I imagine some sort of historical consistency is desirable.
The historical character class is based on ASCII. A Unicode version
needs to be able to represent any character in any language, so in
some cases, you need to specify multiple code points. This is not
just for things like "ch" but also for a base character + combining
mark that does not have a precomposed form in the standard.
Mike
This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 08:02:34 CDT