Wild Card Collation Matches
richard.wordingham at ntlworld.com
Sun Jun 1 20:36:14 CDT 2014
In a fairly wild environment
I encountered the following question:
"If you search for ก* do you expect to return words such as เก่ง and
Now, as a regular expression, in UTS#18 'Unicode Regular Expressions'
Version 13 (dated 2008, superseded in 2012), RL3.5 comes pretty close
to this with ranges tailored for collation. The pattern
[\u0E01-\u0E02]* would match both those words. To be precise, one
would use a search for [ก-ไก]*. RL3.5 has been with withdrawn because
of difficulties, though I can't say that I see it as a major difficulty
that at least one of [A-z] and [a-Z] is empty. Even POSIX is aware of
that little issue.
Turning to fully collation-based definition of searches, UTS#10
Unicode Collation Algorithm's definition DS2 comes closest to answering
the question for the UTC. DS2 reads:
DS2. The pattern string P has a match at Q[s,e] according to collation
C if C generates the same sort key for P as for Q[s,e], and the offsets
s and e meet the boundary condition B. One can also say P has a match
in Q according to C.
It's a soft job to create sequences of codepoints P starting with
U+0E01 THAI CHARACTER KO KAI that are tertiary matches for เก่ง and
ไก่ under both DUCET and the CLDR collations for Thai. Can I therefore
say that the two strings match the pattern ก* according to these
collations? (A pattern P for ไก่ <U+0E44 THAI CHARACTER SARA AI
MAIMALAI, U+0E01 THAI CHARACTER KO KAI, U+0E48 THAI CHARACTER MAI EK> is
P = <U+0E01, U+0E34F COMBINING GRAPHEME JOINER, U+0E44, U+0E48>.)
Disturbingly, another possible answer is that there is no match for
<U+0E01 THAI CHARACTER KO KAI> in either string because it only occurs
in the legacy/extended grapheme cluster <U+0E01, U+0E48>.
More information about the Unicode