Re: Draft PRI Text on Unicode Regular
Expression Guidelines
Source: Mark Davis
Date: 2011/02/10
The Unicode consortium is considering changes to the Unicode Regular Expression Guidelines (http://unicode.org/reports/tr18/) that have arisen in connection with case-insensitive and canonical-equivalent matching, and is soliciting feedback on these changes.
Background
Part of the issue is defining more precisely the connection between matching of regular expressions and equivalence relations among strings. More formally:
Matching under Equivalence Relations. A regular expression R matches according to an equivalence relation E whenever for all strings S and T, if S is equivalent to T under E, then R matches S if and only if R matches T.
For case-insensitivity, the equivalence relation is established in Unicode according to whether two strings case-fold to the same value. The case folding can either be simple (a 1:1 mapping of code points) or full (which has 1:N mappings). For canonical equivalence, the equivalence relation is established by whether two strings have the same NFD format, which has N:M mappings.
Examples:
1. Full vs Simple case-insensitive matching
The consortium is planning to withdraw the recommendation for doing full case-insensitive matching in RL2.4 Default Loose Matches. The text would be modified to focus this section on conversion.
The text currently reads:
RL2.4 Default Loose Matches
To meet this requirement:
The new text would read:
RL2.4 Default Case Conversion
To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the full, default Unicode case conversion.
There are two reasons for removing full case-insensitive matching:
What is feasible is to describe how to transform text into the fully-case-folded form, and construct regular expressions targeted at such text. So the text of UTS#10 focus on such guidelines and not requirements.
Note: the text “To correctly implement a caseless match and case conversions, see UAX #21: Case Mappings [Case].” would also be updated to become a current reference.
2. Canonical-equivalent matching
The consortium is planning to withdraw the recommendation for doing full canonical-equivalence matching in RL2.1 Canonical Equivalents. The current text has:
RL2.1 Canonical Equivalents
To meet this requirement, an implementation shall provide a mechanism for ensuring that all canonically equivalent literal characters match.
However, the way most regular expression engines work, this requirement cannot be satisfied. The problem is that canonical equivalence may involve reordering of characters. For example, each of the following are equivalent:
The regular expression pattern /o\x{31B}/ matches the first two characters of #1, the first and third characters of #2, the first character of #3, part of the first character together with the third character of #4, and part of the character of #5. Some of these issues are brought out in the text of UTS#18, but implementing a strict reading of RL2.1 is infeasible, because in practice regex APIs are not set up to match parts of characters or handle discontiguous selections.
There are many other edge cases: A combining mark may come from some part of the pattern far removed from where the base character was, or may not explicitly be in the pattern at all. It is unclear what a dot should match, or how back references should work.
What is feasible is to construct patterns that will match against NFD (or NFKD) text, and the text of UTS#18 should reflect that. That is, it should describe a process whereby
Note that the author of the pattern must know the normalization form of the text, and write the pattern accordingly.
3. Case-insensitive matching with properties
The consortium is planning to describe more precisely how to match case-insensitively.
In particular, a regular expression pattern P can be made to match insensitively by making the following changes in the interpretation of P:
For both property character classes and explicit character classes, closing under simple case-insensitivity means including characters not in the set.
There have been suggestions to restrict case insensitive regex matching so that it would not apply to some or all property-based character classes. One suggestion, for example, is to close all of the POSIX properties under case, but not others. That would require some narrower notion of matching under an equivalence than that presented in Matching under Equivalence Relations above in the Background section. For example, under Matching under Equivalence Relations, the following is true:
/(?i)[[\x{80}-\x{FF}]-[:Block=Latin_1_Supplement:]]/ = /[]/
(note that Latin_1_Supplement block is identical to U+0080..U+00FF)
Under a system whereby the block property was not case folded, the following would be true:
/(?i)[[\x{80}-\x{FF}]-[:Block=Latin_1_Supplement:]]/ = /[Å Ÿ]/
If the block property were not case folded, it would also mean that an implementation cannot fully resolve a character class containing properties, and then apply case-closure; instead, it must apply case-closure selectively as the character class is interpreted.