Re: UTF-8 syntax

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Sun Jun 10 2001 - 09:21:23 EDT

Next message: B: "Lenient search engine"
Previous message: Edward Cherlin: "UTF8 is not UTF-8 (was Re: UTF8 vs AL32UTF8)"
In reply to: Kenneth Whistler: "Re: UTF-8 syntax"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

* Lars Marius Garshol
|
| It seems to me that a lenient search engine, since it searches in an
| index it has built for itself, would turn the UTF-8 it indexes into
| a canonical form (say <F0 90 80 80>). It would then canonicalize any
| strings it is asked to search for into the same form, regardless of
| what form they arrived in.

* Kenneth Whistler
|
| My point, however, was not about the time visible to the end user
| for the search engine facility to complete a query on its own
| constructed index, but the time for the search engine to acquire
| and process data for updating its indices.

Ah, I see what you mean. You are of course right that normalization
affects this, but as you pointed out earlier the introduction of
UTF-8S does not affect this at all. Data labelled as, and intended to
be, UTF-8, but that is actually UTF-8S, already exists out there. In
fact, there also exists WML data that mixes ISO 8859-1 and UTF-8 out
there.

A *really* good search engine should normalize its data to some
Unicode normalization form as well.

| Also, there is still the problem of representation of the data, if
| the data itself is stored *in* the search engine facility. Some of
| the search engines just keep pointers out to the web pages, but
| others make text-only digests of the contents, which are stored in
| the search engine databases. For UTF-8 data, the question then
| arises as to whether or not to canonicalize two different forms of
| supplementary character representations in those text-only
| digests. And you then risk a mismatch between the representation of
| text in the digest versus the representation of text in the original
| -- with the possibility then that a *client* search in one will hit
| but miss in the other.

This is true, but I don't think UTF-8S (or the internal workings of
search engines) affects this. The invalid UTF-8 data is already there,
and there is little the search engines can do to help dumb client
applications.

Not matching resources which dumb clients will miss isn't very
helpful, especially since there may also be smart clients out there.

--Lars M.

Next message: B: "Lenient search engine"
Previous message: Edward Cherlin: "UTF8 is not UTF-8 (was Re: UTF8 vs AL32UTF8)"
In reply to: Kenneth Whistler: "Re: UTF-8 syntax"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT