Pattern Properties

L2/03-172

Re:	Pattern Properties
From:	Mark Davis
Date:	2003-5-31

There are many circumstances where software interprets patterns that are a mixture of literal characters, whitespace, and syntax characters. Examples include regular expressions, Java collation rules, Excel number formats, and many others. These patterns have been very limited in the past, forced to use clumsy combinations of ASCII characters for their syntax. As Unicode becomes ubiquitous, some of these will start to use non-ASCII characters for their syntax: first as more readable optional alternatives, then eventually as the standard syntax.

For forwards and backwards compatibility, it is very advantageous to have a fixed set of whitespace and syntax code points for use in patterns. This follows the recommendations that the Unicode consortium made regarding completely stable identifiers, and the practice that we see in XML 1.1. In our recommendations, we committed to not allocating characters suitable for identifiers in the range 2190..2BFF, which is being used by XML 1.1.

A pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. By using this policy, it preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results.

For example, in version 1.0 of program X, '≈' is a reserved syntax character, e.g. it doesn't perform an operation, but you have to quote it. In version 1.1, '≈' gets a real meaning, e.g. swiggle the following characters.

The pattern abc...\≈...xyz works on both version 1.0 and 1.1, and refers to the literal character since it is quoted in both cases.

The pattern abc...≈...xyz works on 1.1 and swiggles the following characters. On version 1.0, the engine (rightfully) has no idea what to do with ≈. Rather than silently fail (by ignoring ≈ or turning it into a literal), it can signal an error.

Just as in the case of XML 1.1, we should provide guidance by providing a recommended set of code points that can be used for such pattern whitespace and syntax characters. Particular pattern languages may, of course, override these recommendations (for example, adding or removing other characters for compatibility in ASCII). But by providing a list of these in UCD properties, we provide a stable, common basis for future expansion.

Note that to be useful, the property values would be absolutely invariant; not changing with successive versions of Unicode. Of course, this doesn't limit the ability of the Unicode Standard to add more symbol or whitespace characters, but the syntax and whitespace characters recommended for use in patterns would not change.

This is a proposal for adding the two pattern properties to the next appropriate version of the UCD.

The proposed Pattern_White_Space characters were derived from White_Space by removing some characters that appeared inappropriate for patterns, and adding LRM and RLM.
- Open issue: if we removed all compatibility characters, it would leave only U+0009..U+000D, U+0020, U+0085, U+2028..U+2029, but I think we probably want to retain at least IDEOGRAPHIC SPACE.
- The LRM and RLM are added so as to allow easier use of Arabic and Hebrew in Patterns. This allows neutrals, in particular, to be given consistent direction for readability.
The proposed Pattern_Syntax code points were derived from the following set, then some script-specific characters were removed, along with some other characters that appeared inappropriate for patterns.
- [[:gc=s:] | [:gc=p:] | [\u2190-\u2BFF]]

Note: should anyone want, I can provide a list of the characters that were removed.

0009..000D ; Pattern_White_Space # <CHARACTER TABULATION>..<CARRIAGE RETURN (CR)>
0020       ; Pattern_White_Space # SPACE
0085       ; Pattern_White_Space # <NEXT LINE (NEL)>
00A0       ; Pattern_White_Space # NO-BREAK SPACE
2000..200A ; Pattern_White_Space # EN QUAD..HAIR SPACE
200E..200F ; Pattern_White_Space # LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK
2028       ; Pattern_White_Space # LINE SEPARATOR
2029       ; Pattern_White_Space # PARAGRAPH SEPARATOR
202F       ; Pattern_White_Space # NARROW NO-BREAK SPACE
205F       ; Pattern_White_Space # MEDIUM MATHEMATICAL SPACE
3000       ; Pattern_White_Space # IDEOGRAPHIC SPACE

# Latin-1

0021..002F ; Pattern_Syntax # EXCLAMATION MARK..SOLIDUS
003A..0040 ; Pattern_Syntax # COLON..COMMERCIAL AT
005B..0060 ; Pattern_Syntax # LEFT SQUARE BRACKET..GRAVE ACCENT
007B..007E ; Pattern_Syntax # LEFT CURLY BRACKET..TILDE
00A1..00A7 ; Pattern_Syntax # INVERTED EXCLAMATION MARK..SECTION SIGN
00A9       ; Pattern_Syntax # COPYRIGHT SIGN
00AB..00AC ; Pattern_Syntax # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK..NOT SIGN
00AE       ; Pattern_Syntax # REGISTERED SIGN
00B0..00B1 ; Pattern_Syntax # DEGREE SIGN..PLUS-MINUS SIGN
00B6..00B7 ; Pattern_Syntax # PILCROW SIGN..MIDDLE DOT
00BB       ; Pattern_Syntax # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
00BF       ; Pattern_Syntax # INVERTED QUESTION MARK
00D7       ; Pattern_Syntax # MULTIPLICATION SIGN
00F7       ; Pattern_Syntax # DIVISION SIGN

# General punctuation, may include currently unassigned code points

2010..2027 ; Pattern_Syntax # HYPHEN..HYPHENATION POINT
2030..205E ; Pattern_Syntax # PER MILLE SIGN..<unassigned>

# Whole blocks, may include currently unassigned code points
#   Arrows, Mathematical Operators, Miscellaneous Technical,
#   Control Pictures, Optical Character Recognition
#   Enclosed Alphanumerics, Box Drawing, Block Elements,
#   Geometric Shapes, Miscellaneous Symbols, Dingbats
#   Miscellaneous Mathematical Symbols-A, Supplemental Arrows-A,
#   Braille Patterns, Supplemental Arrows-B, Miscellaneous Mathematical Symbols-B, 
#   Supplemental Mathematical Operators, Miscellaneous Symbols and Arrows

2190..2BFF ; Pattern_Syntax # LEFTWARDS ARROW..<unassigned-2BFF>

# CJK Symbols and Punctuation

3001..3003 ; Pattern_Syntax # IDEOGRAPHIC COMMA..DITTO MARK
3008..3020 ; Pattern_Syntax # LEFT ANGLE BRACKET..POSTAL MARK FACE
3030       ; Pattern_Syntax # WAVY DASH

#Arabic Presentation Forms-A (should have been encoded elsewhere)

FD3E..FD3F ; Pattern_Syntax # ORNATE LEFT PARENTHESIS..ORNATE RIGHT PARENTHESIS

#CJK Compatibility Forms (Question: should these be added?)

FE45..FE46 ; Pattern_Syntax # SESAME DOT..WHITE SESAME DOT