Re: UTF-8 syntax

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 08 2001 - 21:21:09 EDT


Jianping said:

> The issue comes from unpaired surrogates as <ED A0 80> and <ED B0 80>

These are not *unpaired* surrogates -- they are *paired* surrogates.
Else your equating them to <F0 90 80 80> or U-00010000 would make no sense.

> can be
> in UTF-8

They cannot be in well-formed UTF-8. They can only be in ill-formed
UTF-8 of the irregular subtype.

> and your search for <F0 90 80 80> (which is Unicode scalar value
> U-00010000) cannot find it. But however, when the UTF-8 string converted into
> UTF-16, <ED A0 80> and <ED B0 80> will become
> <D800 DC00>, and you can find the same character by searching <D800 DC00> in
> UTF-16.
>
> Unless this unpaired surrogate will be totally eliminated from UTF forms, this
> issue could be hit.

*PAIRED* surrogates.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT