Jianping Yang wrote:
>
> [UTF-8S] will fix the following problem for example:
> For a searching engine to search the character U-00010000 in UTF-8 string, and it
> could not find. But when UTF-8 is converted into UTF-16, it can found it there
> because <ED A0 80> and <ED B0 80> are converted into U-0001000 in UTF-16.
I am not sure I can follow your argumentation. I am under the very impression
that for any search in any Unicode stream, which is complex enough to have to deal
with surrogates (less than a few ppm of the real inputs), it is much more important
to pre-cook the datas searched, at least to get into some canonical form. And when
you are dealing with some form of pre-cooking, I am sure your search engine will
"eat" the irregular forms and normalize them to the correct form, so the search
for <F0 90 80 80> will match your <ED A0 80 ED B0 80> easily.
YMMV.
Antoine
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT