Re: UTF-8 syntax

From: Kenneth Whistler (kenw@sybase.com)
Date: Sat Jun 09 2001 - 14:17:20 EDT


Lars M. responded:

> |
> | A *lenient* search engine could also search for the irregular
> | pattern, i.e., it could consider <F0 90 80 80> and <ED A0 80 ED B0
> | 80> both to be matches for U-00010000, but that would slow it down.
>
> It seems to me that a lenient search engine, since it searches in an
> index it has built for itself, would turn the UTF-8 it indexes into a
> canonical form (say <F0 90 80 80>). It would then canonicalize any
> strings it is asked to search for into the same form, regardless of
> what form they arrived in.

Certainly a "lenient" search engine facility would normalize any
data that it stores in its indexes. After all, it is faced with
the same problem that the databases have -- if you don't normalize
when you build your indexes, you risk misinsertions and misses on
queries, because you have two *different* representations of the
same data.

My point, however, was not about the time visible to the end user
for the search engine facility to complete a query on its own
constructed index, but the time for the search engine to acquire
and process data for updating its indices. After all, in order to
do what you suggest, it has to "canonicalize" the data for comparison
before indexing it. So if the web crawler for the search engine
before took 1879 megamoons to complete its process of X UTF-8
pages, it will now take 1879 + k megamoons to complete it. Whether
that is significant in the context of all the other operations
involved, I guess the web spider people would have to answer.
But I think it is undeniable that any increase in the use of
alternate representations of the same content must also increase
the burden of canonicalization and normalization that everyone has
for using the data.

Also, there is still the problem of representation of the
data, if the data itself is stored *in* the search engine
facility. Some of the search engines just keep pointers out
to the web pages, but others make text-only digests of the
contents, which are stored in the search engine databases. For
UTF-8 data, the question then arises as to whether or not
to canonicalize two different forms of supplementary character
representations in those text-only digests. And you then
risk a mismatch between the representation of text in the
digest versus the representation of text in the original --
with the possibility then that a *client* search in one
will hit but miss in the other.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT