A 17:55 97-07-29 -0700, Kenneth Whistler a �crit :
[Author not specified]:
>> Thanks for the information, but I don't understand why this is
>> important for a `case-folded' `loose comparison'.
[Kenneth]:
>Ah, 'loose comparison' -- there's the rub. Once you move beyond
>simply a case normalization, you're dealing with a instance of the
>more generic collation and equivalence problem. Essentially a
>loose comparison sets up equivalence classes whose members are
>treated as equal, even though they are distinct characters (or
>even sequences of characters).
>
>Given the fact that there are a few instances of alternate forms
>for lowercase letters, which share a single uppercase form, you
>could use uppercasing as a quick-and-dirty way of making an
>equivalence class for those lowercase letters and their shared
>uppercase form. This would also work, for instance, for the
>Greek sigma and final sigma. But it doesn't work if the alternative
>form happens to be the uppercase. Note the Greek alternative forms
>for the uppercase upsilon, which match only a single lowercase
>form for the upsilon.
>
>But this only opens the door to plenty of other oddball cases.
>Scan unidata2.txt for other instances of compatibility equivalences
>to Latin letters. Most of the letterlike symbols (U+2100 ff) cause
>a problem. Should the Angstrom symbol be equated to the letter a-ring
>or not? Uppercasing alone won't cover you there.
[Author not specified]:
>> >From a user standpoint, they are asking for a case blind comparison.
>> What characters do they want to be equal?
[Author not specified]:
>> For example, we have:
>>
>> U+0053 LATIN CAPITAL LETTER S
>> U+0073 LATIN SMALL LETTER S
>> U+017F LATIN SMALL LETTER LONG S
[Kenneth]:
>What about U+02E2 MODIFIER LETTER SMALL S ? Once you think you
>are going to be running into data like the long s, you'd be better
>off generalizing your concept of the equivalence class behind the
>loose comparison, rather than relying on an uppercase transform
>to get the right answer for you.
>
>A generalized approach allows the case-blind comparison to be a
>particular instance of the loose comparison, comparable to an
>accent-blind comparison or a case-and-accent-blind comparison, etc.
[Author not specified]:
>> 1. If I map to upper case, then these all map to U+0053, and are therefore
>> equal.
>>
>> 2. If I map to lower case, then U+0053 is equal to U+0073, but these
>> are different from U+017F.
>>
>> My intuitive understanding of case blind comparison agrees with 1,
>> and would be surprised by 2.
>>
>> (One could argue that I should have already mapped U+017F to U+0073
>> before ever considering case, but I am rather reluctant to do this,
>> because case sensitive comparison is often used for exact matching.)
[Kenneth]:
>I think you need to distinguish between:
>
> 1. exact binary matching ( a-acute != a + combining acute )
> 2. exact match on canonical equivalence ( a-acute = a + combining acute)
> 3. case sensitive match on equivalence classes for particular collation
> (where, for example, s = long s = modifer letter small s)
>
>--Ken Whistler
>
>>
>> So, mapping to upper case seems to provide what I would expect users
>> to want (should any of our users ever have U+017F in their database
>> application).
[Alain]:
In French, in searching with search engines, we have a need for different
levels of imprecise searches:
1.there is a need for totally equal matches:
�=�=�
(ideally, independently of coding if multiple coding is used)
2. there is in some cases a need for loose matches that are case
insensitive, as in English:
�=� but not equal to e=E (if you search for word "d�" (i.e. "a due"),
you should be able not retrieve "du", the latter being the 22nd most
used word in French (the definite article), unless you really mean
it.
3. there is a general need for loose matches that are both case and accent
insensitive (Altavista does this marvelously well, not all search
engines though, which are poor for this):
the words "cl�" or "clef" (or "CL�" or "CLEF", etc.) should
be retrieved at will with parameter "cle"
4. there is a need for searching while ignoring some special characters
(whose list is determined ideally by user preferences [see also
Canadian standard CAN/CSA Z243.4.1): if you search for word
"vice-versa", which can also be spelled "vice versa" in French, the
special character should be ignored. In practice this situation also
occurs for family names:
"L'heureux" is equivalent in Qu�bec to "Lheureux" in telephone books
(the same with my name (; , "Labont�" is equivalent to "La Bont�"
[don't ask me why I do not put the space in my signature (; !!! ,
some schemes simply do not retrieve me - that would be helpful if I
were a criminal]. Search engines are poor in this so far... that
should at least be an option.
In this vein, ISO/IEC 14651 defines a notion of equivalence that is
systematic, data being structured a bit like floating point is structured
in machines for computing, from the most significant elements to the least
(if you truncate a floating point number from the right and back, at the
limit leaving only one bit for the sign, you still can do comparisons : the
bit sign tells you if a number is negative or positive, the exponent gives
you its order of magnitude, the mantissa gives more precision, bit after bit):
level 1: base letter, whatever it is for a given language
level 2: diacritical marks applying to level 1
level 3: case or shape variant applying to level 1
level 4: special characters
And we define an API that allows you to ask for a comparison result whose
precision will be based on these levels... For details see the latest draft
or drafts to come.
Alain LaBont�
Qu�bec
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT