Re: UTF-8 syntax

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Fri Jun 08 2001 - 20:26:54 EDT


The issue comes from unpaired surrogates as <ED A0 80> and <ED B0 80> can be
in UTF-8 and your search for <F0 90 80 80> (which is Unicode scalar value
U-00010000) cannot find it. But however, when the UTF-8 string converted into
UTF-16, <ED A0 80> and <ED B0 80> will become
<D800 DC00>, and you can find the same character by searching <D800 DC00> in
UTF-16.

Unless this unpaired surrogate will be totally eliminated from UTF forms, this
issue could be hit.

Regards,
Jianping.

"Ayers, Mike" wrote:

> > From: Jianping Yang [mailto:Jianping.Yang@oracle.com]
>
> > This will fix the following problem for example:
> > For a searching engine to search the character U-00010000 in
> > UTF-8 string, and it
> > could not find. But when UTF-8 is converted into UTF-16, it
> > can found it there
> > because <ED A0 80> and <ED B0 80> are converted into
> > U-0001000 in UTF-16.
>
> (scratches head)
>
> HUH?
>
> To find U-00010000 in UTF-8, just search for <F0 90 80 80>[1] and
> find it. If you convert to UTF-16, you will need to search for something
> else[2], which will not be <00010000>[4], which is the UTF-32
> representation. So I fail to see how anything gets "fixed" here.
>
> I am getting more convinced as this goes along that there is not a
> single technical reason for UTF-8s.
>
> /|/|ike
>
> [1] - Byte conversion courtesy of Cima's UTF-8 Magic Pocket Encoder[3].
>
> [2] - I can't convert UTF-16 ... Marco? Please? How about a UTF-16 Magic
> Pocket Encoder?
>
> [3] - Which is NOT used to encode magic pockets.
>
> [4] - Magic Pocket Encoder not necessary for this one.





This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT