PRI#203: UTS#10 (UCA) update : generalisation of asymmetric searches

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 31 Aug 2011 04:47:41 +0200

The addition of the "asymmetric" search (in the new section 8.2)
exhibits in fact another need: users actually will want to have
variable strength to match various parts of the search string.
For now, the addition seems to assume that *only* the collation
elements that have the lowest collation weights at a given level lower
than or equal to the global search strength, are supposed to be
meaning that they would match all the other collation elements with
the same collation weights.

What is really needed is to create regular expressions where for
example some parts would need to match with strength=third when other
parts would match more loosely with a lower strength. Given the way
the DUCET is built, as well as all standard tailorings for languages
in the CLDR, it means that (for example with a search strength level
3) all lowercase letters in the search string would necessarily match
loosely independantly of case, and all uppercase letters would need
to match more exactly.

Doesn't it mean that regular expressions could specify this strength explicitly?

For example, searching for "résumé" matches all case combinations (and
all combinations of accents or diacritics, except on "é"), but isn't
there also the need to match the other letters more exactly, while
also preserving the loose case matching ? We could say for example to
search the initial "r" with loose case matching, and loose diacritics
matching, then "é" with loose case matching but exact matching of the
accent, and so on. Parts of the regexp would then specify the strength
at which they will match. And it will be possible to have stricter
matches for some or all lowercase letters, and loose case matching
only for uppercase letters.

The "asymmetric search" described in this update just describes an
oversimplification of a more complete problem, and supposes that users
of regular expressions know exactly which character is "marked" or
"unmarked" at some level, something that is absolutely not obvious
when you have very large collation tables (it is not evident for
example that lowercase letters in search regexps mean loose matching
and uppercase letters mean strict matching of case *and* diacritics).

-- Philippe.
Received on Tue Aug 30 2011 - 21:50:42 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 30 2011 - 21:50:43 CDT