L2/07-349

UAX #18 Suggested Changes

M. Davis, A. Heninger
October 10, 2007 (draft 2)

Based on the feedback and discussion on the Unicode list, we suggest the following changes be incorporated into a new proposed update draft of UTS #18 for public review and comment.

Introduction, and in the Level introductions

Make the application of the levels clearer. In particular that:

All regex implementations dealing with Unicode should be at least at Level 1.
Level 2 is good for implementations that need to handle more Unicode features, and are achievable without too much effort. Some of the subitems in Level 2 are more important, however, than others.
Level 3 contains information about extensions that may only be useful for specific applications and may require further investigation for effective implementation.

1.2 Properties

At the end of the section, describe some useful optional syntax.

1. not equals with property values, to allow for more natural expressions with property values

As well as

\P{propname=value} and [:^propname=value:]

to have:

\p{propname!=value} or \p{propname≠value}
[:propname!=value:] or [:propname≠value:]

2. multiple property values, allowing for much more compact expressions for multiple property values

propname=value1|value2|value3...

\p{gc=L|M|Nd} is equivalent to [\p{gc=L}\p{gc=M}\p{gc=Nd}]

(Maybe put #2 in RL2.6 Wildcard Properties)

Section 1.3 (and elsewhere)

Replace & by && when used as illustrative syntax for set intersection (eg in constructing character ranges such as [\p{Greek}&&\p{Letter}]).
Similarly, replace | by || when used as illustrative syntax for set union
And - by -- when used as illustrative syntax for set difference.
Document expected precedence among operations.

1.6 Line Boundaries

Revise the text to remove "multiline mode". That is, change the following two bullets

If not in "multiline mode", must not match any of the newline sequences.
If in "multiline mode", must match all of the newline sequences, and \u000D\u000A (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.)

to the single bullet:

Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u000D\u000A (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.)

2 Extended Unicode Support: Level 2

Make it clear that the items in level 2 are not in order of importance. In particular, the highest priority ones in practice are:

RL2.3 Default Word Boundaries
RL2.5 Name Properties
RL2.6 Wildcard Properties

2.1 Canonical Equivalents

Describe that one of the most effective ways to implement canonical equivalents is by having a special mode that makes all matches be done on grapheme cluster boundaries, since it avoids the reordering problems that can happen in normalization.

2.2 Default Grapheme Clusters

Move this section to level 3, and add as recommendations that

[a-m \q{ch} \q{rr}] should behave like (?> ch | rr | [am]) as interpreted in Perl-like regex engines -- matching ch or rr and advancing by two code points, or matching a-m and advancing one code point, or failing to match.

Note that "(?> ch | rr | [a-m])heese" will match "chheese" but not "cheese"; that is the c in [a-m] will not match if the "ch" has already matched.

Matching a complemented set containing strings like \q{ch} may behave differently in different modes: the normal mode where code points are the unit of matching, or a mode where grapheme clusters are the unit of matching. That is, [^ a-z \q{ch} \q{rr}] should behave like:

in "normal" mode: (?! ch | rr | [a-m] ) [\x{0}-\x{10FFFF}] -- failing with strings starting with a-z, ch, or rr, and otherwise advancing by one code point
in "grapheme cluster" mode: (?! ch | rr | [a-m] ) \X -- failing with strings starting with a-z, ch, or rr, and otherwise advancing by a grapheme cluster

When interpreting a complex character set containing strings like \q{ch} plus embedded complement operations, it works best to interpret as if the complement were "pushed up" to the top of the expression, using the following rewrites recursively:

Original	Rewrite	Original		Rewrite
^x \|\| y	^(x -- y)	^x && y	^x -- ^y	y -- x
x \|\| ^y	^(y -- x)	x && ^y		x -- y
^x \|\| ^y	^(x && y)	^x && ^y	^x -- y	^(x \|\| y)
^^x	x		x -- ^y	x && y

Applying these rewrites will end up with either the complement operations being completely eliminated, or a single remaining complement operation at the top level. Logically, the rest of the expression is then a flat list of characters and/or multicharacter strings, and matching strings can then can be handled as in #1 or #2 above.