Re: UTF-8 syntax

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Fri Jun 08 2001 - 21:40:07 EDT


Ken,

Thanks, your comment could close this argument against UTF-8S syntax as the attack
here is groundless now, because there is no need to encoding <ED A0 80> and <ED B0
80> as separate *paired* surrogates in UTF-8S and they will always be converted
into 0x10000 in UTF-32 or <F0 90 80 80> in UTF-8. So there is no ambiguity anymore
in UTF-8S.

Regards,
Jianping.

Kenneth Whistler wrote:

> Jianping said:
>
> > The issue comes from unpaired surrogates as <ED A0 80> and <ED B0 80>
>
> These are not *unpaired* surrogates -- they are *paired* surrogates.
> Else your equating them to <F0 90 80 80> or U-00010000 would make no sense.
>
> > can be
> > in UTF-8
>
> They cannot be in well-formed UTF-8. They can only be in ill-formed
> UTF-8 of the irregular subtype.
>
> > and your search for <F0 90 80 80> (which is Unicode scalar value
> > U-00010000) cannot find it. But however, when the UTF-8 string converted into
> > UTF-16, <ED A0 80> and <ED B0 80> will become
> > <D800 DC00>, and you can find the same character by searching <D800 DC00> in
> > UTF-16.
> >
> > Unless this unpaired surrogate will be totally eliminated from UTF forms, this
> > issue could be hit.
>
> *PAIRED* surrogates.
>
> --Ken





This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT