Re: Regex Clarifications
From: Mark Davis
Date: 2011-02-10
Here are some requested clarifications, on the basis of discussions on the i18n-dev email list.
1. RL1.1 Hex Notation
To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF).
It should be made clear that the syntax must allow use of the hex notation for the Unicode code point rather than the corresponding code units for UTF-16 or UTF-8, so syntax like \x{D800}\x{DC00} or \x{F0}\x{90}\x{80}\x{80} do not meet this requirement.
2. RL1.7 Supplementary Code Points
To meet this requirement, an implementation shall handle the full
range of Unicode code points, including values from U+FFFF to
U+10FFFF. In particular, where UTF-16 is used, a sequence
consisting of a leading surrogate followed by a trailing surrogate
shall be handled as a single code point in matching.
Add a note that it is permissible but not required to match an isolated surrogate code point (such as \x{D800}, in text that supports it (Unicode 16-bit strings and Unicode 32-bit characters).
3. Conformance clause 0
C0. An implementation claiming conformance to this
specification at any Level shall identify the version of
this specification and the version of the Unicode Standard.
It is unclear that we want to require the specific version of Unicode.
4. RL1.2 Properties
To meet this requirement, an implementation shall provide at
least a minimal list of properties, consisting of the following:
Make it even clearer that in order to meet this requirement, the implementation has to satisfy the Unicode definition of these, not others. However, the names used for the properties might need to be different for compatibility. For example, if a regex engine already has “Alphabetic”, for compatibility it may need a different name, such as “Unicode_Alphabetic”
5. In addition, I think we should add a new Level 2 condition.
Add this to the proposed update UTS#18:
RL2.7 Full Property Support
To meet this requirement, an implementation shall provide all Unicode properties listed below.
This list will be populated by including the properties in Table 7. Property Index by Scope of Use (http://www.unicode.org/reports/tr44/#Property_Index), with the following exceptions:
Ed Note: Feedback is requested on whether exceptions should be added.