Re: UTF-8 syntax

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Fri Jun 08 2001 - 18:08:36 EDT


* Kenneth Whistler
|
| The problem comes when someone, contrary to the conformance
| requirements of the standard, has emitted irregular UTF-8 for the
| character in question, so that instead of <F0 90 80 80>, the string
| has <ED A0 80 ED B0 80> in it.
|
| A *lenient* search engine could also search for the irregular
| pattern, i.e., it could consider <F0 90 80 80> and <ED A0 80 ED B0
| 80> both to be matches for U-00010000, but that would slow it down.

It seems to me that a lenient search engine, since it searches in an
index it has built for itself, would turn the UTF-8 it indexes into a
canonical form (say <F0 90 80 80>). It would then canonicalize any
strings it is asked to search for into the same form, regardless of
what form they arrived in.

--Lars M.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT