L2/07-349
UAX #18 Suggested Changes
M. Davis, A. Heninger
October 10, 2007
(draft
2)
Based on the feedback and discussion on the Unicode list, we suggest the
following changes be incorporated into a new proposed update draft of UTS #18
for public review and comment.
Make the application of the levels clearer. In particular that:
-
All regex implementations dealing with Unicode should be at least at Level
1.
-
Level 2 is good for implementations that need to handle more Unicode
features, and are achievable without too much effort. Some of the subitems
in Level 2 are more important, however, than others.
-
Level 3 contains information about extensions that may only be useful for
specific applications and may require further investigation for effective
implementation.
At the end of the section, describe some useful optional syntax.
1. not equals with property values, to allow for more natural expressions with
property values
As well as
-
\P{propname=value} and [:^propname=value:]
to have:
-
\p{propname!=value} or \p{propname≠value}
-
[:propname!=value:] or [:propname≠value:]
2. multiple property values, allowing for much more compact expressions for
multiple property values
propname=value1|value2|value3...
eg
-
\p{gc=L|M|Nd} is equivalent to [\p{gc=L}\p{gc=M}\p{gc=Nd}]
Section 1.3 (and elsewhere)
-
Replace & by && when used as illustrative syntax for set
intersection (eg in constructing character ranges such as
[\p{Greek}&&\p{Letter}]).
-
Similarly, replace | by || when used as illustrative syntax for set union
-
And - by -- when used as illustrative syntax for set difference.
-
Document expected precedence among operations.
Revise the text to remove "multiline mode". That
is, change the following two bullets
-
If not in "multiline mode", must not match
any of the newline sequences.
-
If in "multiline mode", must match all of the
newline sequences, and
\u000D\u000A
(CRLF) should match as if it were a
single character. (The recommendation that CRLF match as a single character
is, however, not required for conformance to RL1.6.)
to the single bullet:
-
Where the 'arbitrary character pattern' matches a
newline sequence, it must match all of the newline sequences, and
\u000D\u000A
(CRLF) should match as if it were a single character. (The
recommendation that CRLF match as a single character is, however, not
required for conformance to RL1.6.)
Make it clear that the items in level 2 are not in order of importance. In
particular, the highest priority ones in practice are:
Describe that one of the most effective ways to implement canonical equivalents
is by having a special mode that makes all matches be done on grapheme cluster
boundaries, since it avoids the reordering problems that can happen in
normalization.
Move this section to level 3, and add as recommendations that
-
[a-m \q{ch} \q{rr}] should behave
like (?> ch | rr |
[am]) as interpreted in Perl-like regex engines -- matching ch or rr and
advancing by two code points, or matching a-m and advancing one code
point, or failing to match.
-
Note that "(?> ch | rr | [a-m])heese" will match "chheese" but not
"cheese"; that is the c in [a-m] will not match if the "ch" has already
matched.
-
Matching a complemented set containing strings like \q{ch} may behave
differently in different modes: the normal mode where code points are the
unit of matching, or a mode where grapheme clusters are the unit of
matching. That is, [^ a-z \q{ch} \q{rr}] should behave like:
-
in "normal" mode: (?! ch | rr | [a-m] ) [\x{0}-\x{10FFFF}] -- failing
with strings starting with a-z, ch, or rr, and otherwise advancing by
one code point
-
in "grapheme cluster" mode: (?! ch | rr | [a-m] ) \X -- failing with
strings starting with a-z, ch, or rr, and otherwise advancing by a
grapheme cluster
-
When interpreting a complex character set containing strings like \q{ch}
plus embedded complement operations, it works best to interpret as if the
complement were "pushed up" to the top of the expression, using the
following rewrites recursively:
Original
|
Rewrite
|
Original
|
Rewrite
|
^x || y
|
^(x -- y)
|
^x && y
|
^x -- ^y
|
y -- x
|
x || ^y
|
^(y -- x)
|
x && ^y
|
|
x -- y
|
^x || ^y
|
^(x && y)
|
^x && ^y
|
^x -- y
|
^(x || y)
|
^^x
|
x
|
|
x -- ^y
|
x && y
|
Applying these rewrites will end up with either the complement operations
being completely eliminated, or a single remaining complement operation at the
top level. Logically, the rest of the expression is then a flat list of
characters and/or multicharacter strings, and matching strings can then can be
handled as in #1 or #2 above.