Hi,
I have a few questions regarding unicode regular expressions.
1) I'm working on a regexp matcher and I'd like to know which properties
are never needed in a \p{...} item. Currently I have included the properties
listed below, but for efficiency reasons I'd like to trough out what isn't
really necessary:
general category
bidi class ?
canonical combining class ?
decomposition type
line break
east asian width
arabic joining type ?
arabic joining group ?
script name
block name
age
numeric type
all binary properties
So can anyone tell me if the marked properties are really usefull in
a \p{...} item?
2) About grapheme clusters in a bracketed expression. It is clear what is
meant by an expression like [a-z\g{aa}]. But how do I interprete something
like [a-z\g{aa} & \p{foo}]. This reads as: accept any character in range
a-z or grapheme cluster aa, provided it has the foo property. The problem
is that \p{...} only applies to single code points, not to grapheme clusters.
I can do three things:
1. try if NFC of characters in \g{...} yields a single character and
work with that, otherwise fail
2. only test first (base) character of the cluster
3. don't allow use of operators & and - (i.e. &^) in a bracketed
expression in which one or more \g{...} are used
What would be the most appropiate thing to do?
Regards,
Theo
This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 01:32:53 EDT