L2/03-172
Re: | Pattern Properties |
From: | Mark Davis |
Date: | 2003-5-31 |
There are many circumstances where software interprets patterns that are a mixture of literal characters, whitespace, and syntax characters. Examples include regular expressions, Java collation rules, Excel number formats, and many others. These patterns have been very limited in the past, forced to use clumsy combinations of ASCII characters for their syntax. As Unicode becomes ubiquitous, some of these will start to use non-ASCII characters for their syntax: first as more readable optional alternatives, then eventually as the standard syntax.
For forwards and backwards compatibility, it is very advantageous to have a fixed set of whitespace and syntax code points for use in patterns. This follows the recommendations that the Unicode consortium made regarding completely stable identifiers, and the practice that we see in XML 1.1. In our recommendations, we committed to not allocating characters suitable for identifiers in the range 2190..2BFF, which is being used by XML 1.1.
A pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. By using this policy, it preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results.
For example, in version 1.0 of program X, '≈' is a reserved syntax character, e.g. it doesn't perform an operation, but you have to quote it. In version 1.1, '≈' gets a real meaning, e.g. swiggle the following characters.
- The pattern abc...\≈...xyz works on both version 1.0 and 1.1, and refers to the literal character since it is quoted in both cases.
- The pattern abc...≈...xyz works on 1.1 and swiggles the following characters. On version 1.0, the engine (rightfully) has no idea what to do with ≈. Rather than silently fail (by ignoring ≈ or turning it into a literal), it can signal an error.
Just as in the case of XML 1.1, we should provide guidance by providing a recommended set of code points that can be used for such pattern whitespace and syntax characters. Particular pattern languages may, of course, override these recommendations (for example, adding or removing other characters for compatibility in ASCII). But by providing a list of these in UCD properties, we provide a stable, common basis for future expansion.
Note that to be useful, the property values would be absolutely invariant; not changing with successive versions of Unicode. Of course, this doesn't limit the ability of the Unicode Standard to add more symbol or whitespace characters, but the syntax and whitespace characters recommended for use in patterns would not change.
This is a proposal for adding the two pattern properties to the next appropriate version of the UCD.
The proposed Pattern_White_Space characters were derived from White_Space by removing some characters that appeared inappropriate for patterns, and adding LRM and RLM.
Open issue: if we removed all compatibility characters, it would leave only U+0009..U+000D, U+0020, U+0085, U+2028..U+2029, but I think we probably want to retain at least IDEOGRAPHIC SPACE.
The LRM and RLM are added so as to allow easier use of Arabic and Hebrew in Patterns. This allows neutrals, in particular, to be given consistent direction for readability.
The proposed Pattern_Syntax code points were derived from the following set, then some script-specific characters were removed, along with some other characters that appeared inappropriate for patterns.
[[:gc=s:] | [:gc=p:] | [\u2190-\u2BFF]]
Note: should anyone want, I can provide a list of the characters that were removed.
0009..000D ; Pattern_White_Space # <CHARACTER TABULATION>..<CARRIAGE RETURN (CR)> 0020 ; Pattern_White_Space # SPACE 0085 ; Pattern_White_Space # <NEXT LINE (NEL)> 00A0 ; Pattern_White_Space # NO-BREAK SPACE 2000..200A ; Pattern_White_Space # EN QUAD..HAIR SPACE 200E..200F ; Pattern_White_Space # LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK 2028 ; Pattern_White_Space # LINE SEPARATOR 2029 ; Pattern_White_Space # PARAGRAPH SEPARATOR 202F ; Pattern_White_Space # NARROW NO-BREAK SPACE 205F ; Pattern_White_Space # MEDIUM MATHEMATICAL SPACE 3000 ; Pattern_White_Space # IDEOGRAPHIC SPACE # Latin-1 0021..002F ; Pattern_Syntax # EXCLAMATION MARK..SOLIDUS 003A..0040 ; Pattern_Syntax # COLON..COMMERCIAL AT 005B..0060 ; Pattern_Syntax # LEFT SQUARE BRACKET..GRAVE ACCENT 007B..007E ; Pattern_Syntax # LEFT CURLY BRACKET..TILDE 00A1..00A7 ; Pattern_Syntax # INVERTED EXCLAMATION MARK..SECTION SIGN 00A9 ; Pattern_Syntax # COPYRIGHT SIGN 00AB..00AC ; Pattern_Syntax # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK..NOT SIGN 00AE ; Pattern_Syntax # REGISTERED SIGN 00B0..00B1 ; Pattern_Syntax # DEGREE SIGN..PLUS-MINUS SIGN 00B6..00B7 ; Pattern_Syntax # PILCROW SIGN..MIDDLE DOT 00BB ; Pattern_Syntax # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 00BF ; Pattern_Syntax # INVERTED QUESTION MARK 00D7 ; Pattern_Syntax # MULTIPLICATION SIGN 00F7 ; Pattern_Syntax # DIVISION SIGN # General punctuation, may include currently unassigned code points 2010..2027 ; Pattern_Syntax # HYPHEN..HYPHENATION POINT 2030..205E ; Pattern_Syntax # PER MILLE SIGN..<unassigned> # Whole blocks, may include currently unassigned code points # Arrows, Mathematical Operators, Miscellaneous Technical, # Control Pictures, Optical Character Recognition # Enclosed Alphanumerics, Box Drawing, Block Elements, # Geometric Shapes, Miscellaneous Symbols, Dingbats # Miscellaneous Mathematical Symbols-A, Supplemental Arrows-A, # Braille Patterns, Supplemental Arrows-B, Miscellaneous Mathematical Symbols-B, # Supplemental Mathematical Operators, Miscellaneous Symbols and Arrows 2190..2BFF ; Pattern_Syntax # LEFTWARDS ARROW..<unassigned-2BFF> # CJK Symbols and Punctuation 3001..3003 ; Pattern_Syntax # IDEOGRAPHIC COMMA..DITTO MARK 3008..3020 ; Pattern_Syntax # LEFT ANGLE BRACKET..POSTAL MARK FACE 3030 ; Pattern_Syntax # WAVY DASH #Arabic Presentation Forms-A (should have been encoded elsewhere) FD3E..FD3F ; Pattern_Syntax # ORNATE LEFT PARENTHESIS..ORNATE RIGHT PARENTHESIS #CJK Compatibility Forms (Question: should these be added?) FE45..FE46 ; Pattern_Syntax # SESAME DOT..WHITE SESAME DOT