Re: Case blind comparison (for example in search engines)

From: Alain LaBont\i SCT (alb@sct.gouv.qc.ca)
Date: Mon Aug 11 1997 - 16:25:00 EDT


A 17:55 97-07-29 -0700, Kenneth Whistler a écrit :

[Author not specified]:
>> Thanks for the information, but I don't understand why this is
>> important for a `case-folded' `loose comparison'.

[Kenneth]:
>Ah, 'loose comparison' -- there's the rub. Once you move beyond
>simply a case normalization, you're dealing with a instance of the
>more generic collation and equivalence problem. Essentially a
>loose comparison sets up equivalence classes whose members are
>treated as equal, even though they are distinct characters (or
>even sequences of characters).
>
>Given the fact that there are a few instances of alternate forms
>for lowercase letters, which share a single uppercase form, you
>could use uppercasing as a quick-and-dirty way of making an
>equivalence class for those lowercase letters and their shared
>uppercase form. This would also work, for instance, for the
>Greek sigma and final sigma. But it doesn't work if the alternative
>form happens to be the uppercase. Note the Greek alternative forms
>for the uppercase upsilon, which match only a single lowercase
>form for the upsilon.
>
>But this only opens the door to plenty of other oddball cases.
>Scan unidata2.txt for other instances of compatibility equivalences
>to Latin letters. Most of the letterlike symbols (U+2100 ff) cause
>a problem. Should the Angstrom symbol be equated to the letter a-ring
>or not? Uppercasing alone won't cover you there.
 
[Author not specified]:
>> >From a user standpoint, they are asking for a case blind comparison.
>> What characters do they want to be equal?

[Author not specified]:
>> For example, we have:
>>
>> U+0053 LATIN CAPITAL LETTER S
>> U+0073 LATIN SMALL LETTER S
>> U+017F LATIN SMALL LETTER LONG S

[Kenneth]:
>What about U+02E2 MODIFIER LETTER SMALL S ? Once you think you
>are going to be running into data like the long s, you'd be better
>off generalizing your concept of the equivalence class behind the
>loose comparison, rather than relying on an uppercase transform
>to get the right answer for you.
>
>A generalized approach allows the case-blind comparison to be a
>particular instance of the loose comparison, comparable to an
>accent-blind comparison or a case-and-accent-blind comparison, etc.

[Author not specified]:
>> 1. If I map to upper case, then these all map to U+0053, and are therefore
>> equal.
>>
>> 2. If I map to lower case, then U+0053 is equal to U+0073, but these
>> are different from U+017F.
>>
>> My intuitive understanding of case blind comparison agrees with 1,
>> and would be surprised by 2.
>>
>> (One could argue that I should have already mapped U+017F to U+0073
>> before ever considering case, but I am rather reluctant to do this,
>> because case sensitive comparison is often used for exact matching.)

[Kenneth]:
>I think you need to distinguish between:
>
> 1. exact binary matching ( a-acute != a + combining acute )
> 2. exact match on canonical equivalence ( a-acute = a + combining acute)
> 3. case sensitive match on equivalence classes for particular collation
> (where, for example, s = long s = modifer letter small s)
>
>--Ken Whistler
>
>>
>> So, mapping to upper case seems to provide what I would expect users
>> to want (should any of our users ever have U+017F in their database
>> application).

[Alain]:

In French, in searching with search engines, we have a need for different
levels of imprecise searches:

1.there is a need for totally equal matches:
    é=é=é
   (ideally, independently of coding if multiple coding is used)

2. there is in some cases a need for loose matches that are case
   insensitive, as in English:
    É=é but not equal to e=E (if you search for word "dû" (i.e. "a due"),
    you should be able not retrieve "du", the latter being the 22nd most
    used word in French (the definite article), unless you really mean
    it.

3. there is a general need for loose matches that are both case and accent
   insensitive (Altavista does this marvelously well, not all search
   engines though, which are poor for this):
    the words "clé" or "clef" (or "CLÉ" or "CLEF", etc.) should
    be retrieved at will with parameter "cle"

4. there is a need for searching while ignoring some special characters
   (whose list is determined ideally by user preferences [see also
    Canadian standard CAN/CSA Z243.4.1): if you search for word
   "vice-versa", which can also be spelled "vice versa" in French, the
   special character should be ignored. In practice this situation also
   occurs for family names:
    "L'heureux" is equivalent in Québec to "Lheureux" in telephone books
    (the same with my name (; , "Labonté" is equivalent to "La Bonté"
    [don't ask me why I do not put the space in my signature (; !!! ,
     some schemes simply do not retrieve me - that would be helpful if I
     were a criminal]. Search engines are poor in this so far... that
     should at least be an option.

In this vein, ISO/IEC 14651 defines a notion of equivalence that is
systematic, data being structured a bit like floating point is structured
in machines for computing, from the most significant elements to the least
(if you truncate a floating point number from the right and back, at the
limit leaving only one bit for the sign, you still can do comparisons : the
bit sign tells you if a number is negative or positive, the exponent gives
you its order of magnitude, the mantissa gives more precision, bit after bit):

level 1: base letter, whatever it is for a given language
level 2: diacritical marks applying to level 1
level 3: case or shape variant applying to level 1
level 4: special characters

And we define an API that allows you to ask for a comparison result whose
precision will be based on these levels... For details see the latest draft
or drafts to come.

Alain LaBonté
Québec



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT