Jianping said:
> The issue comes from unpaired surrogates as <ED A0 80> and <ED B0 80>
These are not *unpaired* surrogates -- they are *paired* surrogates.
Else your equating them to <F0 90 80 80> or U-00010000 would make no sense.
> can be
> in UTF-8
They cannot be in well-formed UTF-8. They can only be in ill-formed
UTF-8 of the irregular subtype.
> and your search for <F0 90 80 80> (which is Unicode scalar value
> U-00010000) cannot find it. But however, when the UTF-8 string converted into
> UTF-16, <ED A0 80> and <ED B0 80> will become
> <D800 DC00>, and you can find the same character by searching <D800 DC00> in
> UTF-16.
>
> Unless this unpaired surrogate will be totally eliminated from UTF forms, this
> issue could be hit.
*PAIRED* surrogates.
--Ken
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT