RE: UTF-8 syntax

From: Ayers, Mike (Mike_Ayers@bmc.com)
Date: Fri Jun 08 2001 - 20:50:14 EDT


> From: Jianping Yang [mailto:Jianping.Yang@oracle.com]

> The issue comes from unpaired surrogates as <ED A0 80> and
> <ED B0 80> can be
> in UTF-8 and your search for <F0 90 80 80> (which is Unicode
> scalar value
> U-00010000) cannot find it.

        This is good, because <ED A0 80> is U-0000d800 and <ED B0 80> is
U-0000dc00, so they should not match as U-00010000.

> But, however, when the UTF-8
> string converted into
> UTF-16, <ED A0 80> and <ED B0 80> will become
> <D800 DC00>, and you can find the same character by searching
> <D800 DC00> in
> UTF-16.

        So this solves the problem of not matching the worng data? I'm even
more baffled than when we started!

> Unless this unpaired surrogate will be totally eliminated
> from UTF forms, this
> issue could be hit.

        I still don't know what the issue is.

/|/|ike



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT