Mark Davis, 2005-01-24
The following were missed when adjustments were made for Thai/Lao. So delete the following sections:
3.1.3 Rearrangement
Certain characters are not coded in logical order, such as the Thai vowels เ through ไ and the Lao vowels ເ through ໄ (this list is indicated by the Logical_Order_Exception property). For collation, they are rearranged by swapping with the following character before further processing, since logically they belong afterwards. For example, here is a string processed by rearrangement:
input string:0E01 0E40 0E02 0E03normalized string:0E01 0E02 0E40 0E03
in 8 Searching and Matching (Informative)
The interactions of other conditions with the matching types (minimal, maximal, medial) needs to be clarified. Consider the following.
Value | Notes | |
---|---|---|
Pattern: | abc | |
Strength: | primary | thus ignoring combining marks, punctuation |
Text: | abc¸-°d | two combining marks, cedilla and ring |
Matches: | |abc|¸|-|°|d | four possible endpoints, indicated by | |
When an additional condition is set on the match, the types (minimal, maximal, medial) are based on the matches that meet that condition. Thus if the condition is Whole Grapheme, then the matches are restricted to "abc¸|-°|d", thus discarding match positions that would not be on a grapheme cluster boundary. Thus the minimal match would be "abc¸|-°d"
The changes to the text would include explaining the above situation in the introductory text in that section, and changing DS5 and moving it. Suggestion the following:
Delete current DS5.
Add
DS1a. A boundary condition is a test imposed on an offset within a string. Examples include Whole Grapheme Cluster Search and Whole Word Search, as defined in UAX #29. See [Breaks]).
By using grapheme-complete conditions, contractions and combining sequences are not interrupted. This also avoids the need to present visually discontiguous selections to the user (except for BIDI text).
Revise the following:
Suppose there is a collation C, a pattern string P and a target string Q. C has some particular set of attributes, such as a strength setting, and choice of variable weighting.
DS2. There is a match according to C for P within Q[s,e] if and only if C generates the same sort key for P as for Q[s,e].
to
Suppose there is a collation C, a pattern string P and a target string Q, and a boundary condition B. C has some particular set of attributes, such as a strength setting, and choice of variable weighting.
DS2. There is a match according to C for P within Q[s,e] if and only if C generates the same sort key for P as for Q[s,e], and the offsets s and e meet the condition B.
DS2b A match is grapheme-complete if B requires that the offset be at a grapheme cluster boundary. Note that Whole Word Search as defined in UAX #29 is grapheme complete. See [Breaks]).
I think we should also add some more explanatory text about combining marks. Those can be a bit tricky!
We don't give a way for people to specifically claim conformance to matching and searching according to Section 8. Suggest (a) removing "(informative)" from the title, and (b) adding:
C5 An implementation claiming conformance to Matching and Searching according to UAX #10, shall meet the requirements described in Section 8.