Re: Unicode Search Engines

From: Stefan Probst (stefan.probst@opticom.v-nam.net)
Date: Mon Jan 28 2002 - 10:17:33 EST


On Wed Jan 16 23:49:29 2002 +0400 Aman Chawla wrote:
>Are there any search engines at all at present which allow one to search
>sites encoded in UTF-8? If not, are there plans to build such search
>engines? For example, is Google going to implement such an engine?

I would like to add:
How do they handle normalization?
In Vietnam, many characters can be represented in several different ways:
(1) fully precomposed (NFC)
(2) base character and modifier precomposed, tonal mark combining
(3) base character, then modifier, then tonal mark
(4) like (3), but modifier and tonal mark sorted (NFD)
Do the search engines do any normalization, before indexing a page?
Are queries normalized before running the search?

In other words:
For example, if the page is written in NFC, but the query is entered in
NFD, will it find anything?

Rgds,
Stefan



This archive was generated by hypermail 2.1.2 : Mon Jan 28 2002 - 10:08:24 EST