Re: UTF-8 syntax

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Fri Jun 08 2001 - 18:08:36 EDT

Next message: Kenneth Whistler: "Re: UTF-8 syntax"
Previous message: Ayers, Mike: "RE: UTF-8 syntax"
In reply to: Kenneth Whistler: "Re: UTF-8 syntax"
Next in thread: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

* Kenneth Whistler
|
| The problem comes when someone, contrary to the conformance
| requirements of the standard, has emitted irregular UTF-8 for the
| character in question, so that instead of <F0 90 80 80>, the string
| has <ED A0 80 ED B0 80> in it.
|
| A *lenient* search engine could also search for the irregular
| pattern, i.e., it could consider <F0 90 80 80> and <ED A0 80 ED B0
| 80> both to be matches for U-00010000, but that would slow it down.

It seems to me that a lenient search engine, since it searches in an
index it has built for itself, would turn the UTF-8 it indexes into a
canonical form (say <F0 90 80 80>). It would then canonicalize any
strings it is asked to search for into the same form, regardless of
what form they arrived in.

--Lars M.

Next message: Kenneth Whistler: "Re: UTF-8 syntax"
Previous message: Ayers, Mike: "RE: UTF-8 syntax"
In reply to: Kenneth Whistler: "Re: UTF-8 syntax"
Next in thread: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT