Re: UTF-8 syntax

From: Peter_Constable@sil.org
Date: Thu Jun 07 2001 - 13:22:46 EDT


On 06/07/2001 10:38:15 AM DougEwell2 wrote:

>The ambiguity comes from the fact that, if I am using UTF-8s and I want to
>represent the sequence of (invalid) scalar values <D800 DC00>, I must use
the
>UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
(valid)
>scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED
B0
>80>. Unless you have a crystal ball or are extremely good with tarot
cards,
>you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
>80>, to know whether it is supposed to be mapped back to <D800 DC00> or to
><10000>.

This brings out a good point. We can't yet say that UTF-8s is ambiguous
since it is not formally defined. What this does highlight, though, is a
gap in the proposal that must be addressed before it could be considered: a
well-formed definition for UTF-8 must (by D29) provide a *unique*
representation for *all* USVs, and unless the proposal is amended to remove
D800 - DFFF from the codespace, it must be amended to provide some unique
means of representing things like U+D800. What it is *not allowed* to be is
ambiguous. If UTF-8s considers <ED A0 80 ED B0 80> to mean U+10000, then it
must provide some sequence other than <ED A0 80> to mean U+D800.

>Premise: Unicode should not, and does not, define ambiguous UTFs.
> I think we agree on this.

Yes.

>Premise: UTF-8s is ambiguous in its handling of surrogate code points.
> I tried to prove this above.
>
>Conclusion: Unicode should not define UTF-8s.

I definitely agree with the idea your getting at, but am just looking from
a very slightly different angle. The conclusion does not necessarily follow
because UTF-8s is only a proposal that potentially can be modified. If you
say, "UTF-8s as has been currently proposed would be inconsistent with
D29", then I agree. The proposed definition for UTF-8s *could* potentiall
be revised, though, and so the argument that a UTF-8s cannot be added to
Unicode doesn't hold.

UTF-8s definitely is not tenable as currently proposed, given the current
definitions. I think we agree on that.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT