Re: Case blind comparison

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 29 1997 - 20:56:56 EDT


>
> Thanks for the information, but I don't understand why this is
> important for a `case-folded' `loose comparison'.

Ah, 'loose comparison' -- there's the rub. Once you move beyond
simply a case normalization, you're dealing with a instance of the
more generic collation and equivalence problem. Essentially a
loose comparison sets up equivalence classes whose members are
treated as equal, even though they are distinct characters (or
even sequences of characters).

Given the fact that there are a few instances of alternate forms
for lowercase letters, which share a single uppercase form, you
could use uppercasing as a quick-and-dirty way of making an
equivalence class for those lowercase letters and their shared
uppercase form. This would also work, for instance, for the
Greek sigma and final sigma. But it doesn't work if the alternative
form happens to be the uppercase. Note the Greek alternative forms
for the uppercase upsilon, which match only a single lowercase
form for the upsilon.

But this only opens the door to plenty of other oddball cases.
Scan unidata2.txt for other instances of compatibility equivalences
to Latin letters. Most of the letterlike symbols (U+2100 ff) cause
a problem. Should the Angstrom symbol be equated to the letter a-ring
or not? Uppercasing alone won't cover you there.

>
> >From a user standpoint, they are asking for a case blind comparison.
> What characters do they want to be equal?
>
> For example, we have:
>
> U+0053 LATIN CAPITAL LETTER S
> U+0073 LATIN SMALL LETTER S
> U+017F LATIN SMALL LETTER LONG S

What about U+02E2 MODIFIER LETTER SMALL S ? Once you think you
are going to be running into data like the long s, you'd be better
off generalizing your concept of the equivalence class behind the
loose comparison, rather than relying on an uppercase transform
to get the right answer for you.

A generalized approach allows the case-blind comparison to be a
particular instance of the loose comparison, comparable to an
accent-blind comparison or a case-and-accent-blind comparison, etc.

>
> 1. If I map to upper case, then these all map to U+0053, and are therefore
> equal.
>
> 2. If I map to lower case, then U+0053 is equal to U+0073, but these
> are different from U+017F.
>
> My intuitive understanding of case blind comparison agrees with 1,
> and would be surprised by 2.
>
> (One could argue that I should have already mapped U+017F to U+0073
> before ever considering case, but I am rather reluctant to do this,
> because case sensitive comparison is often used for exact matching.)

I think you need to distinguish between:

    1. exact binary matching ( a-acute != a + combining acute )
    2. exact match on canonical equivalence ( a-acute = a + combining acute)
    3. case sensitive match on equivalence classes for particular collation
         (where, for example, s = long s = modifer letter small s)

--Ken Whistler

>
> So, mapping to upper case seems to provide what I would expect users
> to want (should any of our users ever have U+017F in their database
> application).
> *
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT