Re: UTF-8 syntax

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Mon Jun 11 2001 - 12:12:56 EDT


Jianping Yang wrote:
>
> [UTF-8S] will fix the following problem for example:
> For a searching engine to search the character U-00010000 in UTF-8 string, and it
> could not find. But when UTF-8 is converted into UTF-16, it can found it there
> because <ED A0 80> and <ED B0 80> are converted into U-0001000 in UTF-16.

I am not sure I can follow your argumentation. I am under the very impression
that for any search in any Unicode stream, which is complex enough to have to deal
with surrogates (less than a few ppm of the real inputs), it is much more important
to pre-cook the datas searched, at least to get into some canonical form. And when
you are dealing with some form of pre-cooking, I am sure your search engine will
"eat" the irregular forms and normalize them to the correct form, so the search
for <F0 90 80 80> will match your <ED A0 80 ED B0 80> easily.

YMMV.

Antoine



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT