Re: UTF-8 syntax

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Thu Jun 07 2001 - 21:50:37 EDT


I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in
UTF-16 code unit which means one UTF-16 code unit will be mapped to either one,
two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
UTF-8S, it should also apply to UTF-16, which does not make sense to me.

Regards,
Jianping.

Peter_Constable@sil.org wrote:

> On 06/07/2001 10:38:15 AM DougEwell2 wrote:
>
> >The ambiguity comes from the fact that, if I am using UTF-8s and I want to
> >represent the sequence of (invalid) scalar values <D800 DC00>, I must use
> the
> >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> (valid)
> >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED
> B0
> >80>. Unless you have a crystal ball or are extremely good with tarot
> cards,
> >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
> >80>, to know whether it is supposed to be mapped back to <D800 DC00> or to
> ><10000>.
>
> This brings out a good point. We can't yet say that UTF-8s is ambiguous
> since it is not formally defined. What this does highlight, though, is a
> gap in the proposal that must be addressed before it could be considered: a
> well-formed definition for UTF-8 must (by D29) provide a *unique*
> representation for *all* USVs, and unless the proposal is amended to remove
> D800 - DFFF from the codespace, it must be amended to provide some unique
> means of representing things like U+D800. What it is *not allowed* to be is
> ambiguous. If UTF-8s considers <ED A0 80 ED B0 80> to mean U+10000, then it
> must provide some sequence other than <ED A0 80> to mean U+D800.
>
> >Premise: Unicode should not, and does not, define ambiguous UTFs.
> > I think we agree on this.
>
> Yes.
>
> >Premise: UTF-8s is ambiguous in its handling of surrogate code points.
> > I tried to prove this above.
> >
> >Conclusion: Unicode should not define UTF-8s.
>
> I definitely agree with the idea your getting at, but am just looking from
> a very slightly different angle. The conclusion does not necessarily follow
> because UTF-8s is only a proposal that potentially can be modified. If you
> say, "UTF-8s as has been currently proposed would be inconsistent with
> D29", then I agree. The proposed definition for UTF-8s *could* potentiall
> be revised, though, and so the argument that a UTF-8s cannot be added to
> Unicode doesn't hold.
>
> UTF-8s definitely is not tenable as currently proposed, given the current
> definitions. I think we agree on that.
>
> - Peter
>
> ---------------------------------------------------------------------------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>





This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT