|
179 | Changes to Unicode Regular Expression Guidelines | 2011.07.25 |
Status: | Closed | |
Originator: | UTC | |
Resolution: | The issues discussed in PRI #179 have been folded into the text of UTS #18, and/or PRI #182. | |
Description of Issue:
The Unicode Consortium is considering changes to UTS #18 Unicode Regular Expressions (http://unicode.org/reports/tr18/). These proposed changes have arisen in connection with questions about case-insensitive and canonical-equivalent matching.
The proposed changes eliminate some requirements on implementations of Unicode Regular Expressions which have proven to be problematic in implementations, and add clarifications. The consortium is soliciting feedback on these changes.
Background
Part of the issue to be addressed is to define more precisely the connection between matching of regular expressions and equivalence relations among strings. Matching under equivalence relations can be stated more formally as:
Matching under Equivalence Relations. A regular expression R matches according to an equivalence relation E whenever for all strings S and T, if S is equivalent to T under E, then R matches S if and only if R matches T.
In the Unicode Standard, the relevant equivalence relation for case-insensitivity is established according to whether two strings case fold to the same value. The case folding can either be simple (a 1:1 mapping of code points) or full (with some 1:n mappings).
The equivalence relation for canonical equivalence is established by whether two strings are identical when normalized to NFD. That normalization involves n:m mappings and rearrangements of code points.
1. Full vs. simple case-insensitive matching
It is proposed to withdraw the recommendation for doing full case-insensitive matching in RL2.4 Default Loose Matches. The text would instead be modified to focus that section only on issues of case conversion.
The text currently reads:
RL2.4 Default Loose Matches
To meet this requirement:
The new text would read:
RL2.4 Default Case Conversion
To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the full, default Unicode case conversion.
There are two reasons for removing full case-insensitive matching:
It is feasible to describe how to transform text into the fully-case-folded form, and construct regular expressions targeted at such text. So the discussion in UTS #18 would be changed to focus on such guidelines and not state them as requirements.
Note: The obsolete link in the text of UTS #18 “To correctly implement a caseless match and case conversions, see UAX #21: Case Mappings [Case].” would also be corrected to reference current information about casing in the standard.
2. Canonical-equivalent matching
It is proposed to withdraw the recommendation for doing full canonical-equivalence matching in RL2.1 Canonical Equivalents. The current text reads:
RL2.1 Canonical Equivalents
To meet this requirement, an implementation shall provide a mechanism for ensuring that all canonically equivalent literal characters match.
The way most regular expression engines work, this requirement cannot be satisfied. The reason that it cannot be satisfied results from the fact that canonical equivalence may involve reordering, splitting, or merging of characters. For example, all of the following sequences are canonically equivalent:
The regular expression pattern /o\x{31B}/ matches the first two characters of #1, the first and third characters of #2, the first character of #3, part of the first character together with the third character of #4, and part of the character in #5. Some of these issues are brought out in the text of UTS #18, but implementing RL2.1 is infeasible, because in practice regex APIs are not set up to match parts of characters or handle discontiguous selections.
There are many other edge cases: A combining mark may come from some part of the pattern far removed from where the base character was, or may not explicitly be in the pattern at all. It is also unclear what /./ should match. It is also unclear how regular expression back references should work.
It is feasible to describe how to construct patterns that will match against NFD (or NFKD) text, and the description in UTS #18 will be changed to reflect that. That is, it will describe a process whereby:
Note that the author of the pattern must know the normalization form of the text, and write the pattern accordingly.
3. Case-insensitive matching with properties
It is proposed to add text to UTS #18 to describe more precisely how to match text case-insensitively. The discussion will outline how a regular expression pattern P can be made to match insensitively, by making the following changes in the interpretation of P:
For both property character classes and explicit character classes, closing under simple case-insensitivity means including characters not in the set. For example:
/(?i)[[\x{80}-\x{FF}]-[:Block=Latin_1_Supplement:]]/ = /[]/
(note that Latin_1_Supplement block is identical to U+0080..U+00FF)
Under that alternative approach, in which the block property was not case folded, the following would be true:
/(?i)[[\x{80}-\x{FF}]-[:Block=Latin_1_Supplement:]]/ = /[Å Ÿ]/
Also under that alternative approach, an implementation cannot fully resolve a character class containing properties, and then apply case-closure; instead, it must apply case-closure selectively as the character class is interpreted.
To supply feedback on this issue, see Feedback.