Re: UTF-8 syntax

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Fri Jun 08 2001 - 13:36:18 EDT


Ken,

From your analysis, it make me more believe that we need a UTF-8S not only for the
binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As
proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same
property which is different from UTF-8 and UTF-32. Here we need either to fix UTF-16
to make it have the some property with UTF-8, or to make another one as UTF-8S.

This will fix the following problem for example:
For a searching engine to search the character U-00010000 in UTF-8 string, and it
could not find. But when UTF-8 is converted into UTF-16, it can found it there
because <ED A0 80> and <ED B0 80> are converted into U-0001000 in UTF-16.

Regards,
Jianping.

Kenneth Whistler wrote:

> Jianping,
>
> > I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in
> > UTF-16 code unit which means one UTF-16 code unit will be mapped to either one,
> > two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
> > UTF-8S, it should also apply to UTF-16, which does not make sense to me.
>
> I think the reason you are not following the argument that Doug and Peter
> have been presenting is that you are thinking in terms of a UTF-8s to
> UTF-16 converter, instead of thinking of the UTF's as they are defined
> in relation to scalar values. I.e.,
>
> UTF-8s <==> UTF-16
>
> instead of:
> |==> UTF-8
> USV <==|==> UTF-16
> |==> UTF-32
>
> Let me represent the Unicode Scalar Values (USV) in the 10646 *long*
> notation, so you can't confuse them with UTF-16 code unit values.
>
> |==> <F0 90 80 80>
> U-00010000 <==|==> <D800 DC00>
> |==> <00010000>
>
> That is the current situation for UTF-8, UTF-16, and UTF-32 as
> defined in the standard. You want to introduce a UTF-8s, which
> would put us in the following situation:
>
> |==> <ED A0 80 ED B0 80> UTF-8s
> |==> <F0 90 80 80> UTF-8
> U-00010000 <==|==> <D800 DC00> UTF-16
> |==> <00010000> UTF-32
>
> Then for interworking, you would choose UTF-8s and UTF-16, since
> they have the identical binary ordering properties you want,
> and simplify your conversion and allocation handling as well.
>
> Now the conundrum that Doug and Peter are putting out to you is
> what do you do about the handling of isolated surrogates, which
> the standard also requires you to have a unique sequence for
> (if we consider them to be Unicode scalar values)? Thus:
>
> |==> <ED A0 80> UTF-8s
> |==> <ED A0 80> UTF-8
> U-0000D800 <==|==> <D800> UTF-16
> |==> <0000D800> UTF-32
>
> Now let's put two of those isolated surrogate code points
> together in sequence:
> |==> <ED A0 80 ED B0 80> UTF-8s
> |==> <ED A0 80 ED B0 80> UTF-8
> <U-0000D800, U-0000DC00> <==|==> <D800 DC00> UTF-16
> |==> <0000D800 0000DC00> UTF-32
>
> Here, arguably, both UTF-32 and UTF-8 would maintain a unique,
> roundtrippable distinction between two isolated surrogate
> code points (i.e. Unicode scalar values) in sequence, and
> an ordinary supplemental code point. However, UTF-16 and
> UTF-8s would not. For UTF-16 this is understandable, since
> it was *designed* that way. It cannot really represent sequences of
> isolated surrogate code points, since it uses surrogate code
> *units* as part of the transformation. But by making UTF-8s
> mimic UTF-16, the problem gets worse. The UTF-8s sequence
> cannot distinguish the two either, so it is failing of
> the "unique sequence" requirement. But what is worse, the
> supposedly regular UTF-8s sequence cannot be distinguished from
> the *irregular* UTF-8 sequence for the same thing.
>
> Personally, I think there are other conundrums in the last two
> examples, as applied to UTF-16, that would lead me to prefer
> restricting "Unicode scalar value" itself to non-surrogate
> code points for the purposes of the definition of the UTF's,
> and then leave the last two examples to the error-handling
> exceptions. But in any case, the introduction of UTF-8s
> doesn't make the situation better for these definitions --
> it just creates more points of confusion and inconsistency
> in the definitions.
>
> --Ken
>
> > Peter_Constable@sil.org wrote:
> >
> > > On 06/07/2001 10:38:15 AM DougEwell2 wrote:
> > >
> > > >The ambiguity comes from the fact that, if I am using UTF-8s and I want to
> > > >represent the sequence of (invalid) scalar values <D800 DC00>, I must use
> > > the
> > > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> > > (valid)
> > > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED
> > > B0
> > > >80>. Unless you have a crystal ball or are extremely good with tarot
> > > cards,
> > > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
> > > >80>, to know whether it is supposed to be mapped back to <D800 DC00> or to
> > > ><10000>.
> > >





This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT