Re: UTF-8 syntax

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jun 07 2001 - 23:22:15 EDT


Jianping,

> I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in
> UTF-16 code unit which means one UTF-16 code unit will be mapped to either one,
> two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
> UTF-8S, it should also apply to UTF-16, which does not make sense to me.

I think the reason you are not following the argument that Doug and Peter
have been presenting is that you are thinking in terms of a UTF-8s to
UTF-16 converter, instead of thinking of the UTF's as they are defined
in relation to scalar values. I.e.,

                UTF-8s <==> UTF-16

instead of:
                        |==> UTF-8
                USV <==|==> UTF-16
                        |==> UTF-32

Let me represent the Unicode Scalar Values (USV) in the 10646 *long*
notation, so you can't confuse them with UTF-16 code unit values.

                               |==> <F0 90 80 80>
                U-00010000 <==|==> <D800 DC00>
                               |==> <00010000>

That is the current situation for UTF-8, UTF-16, and UTF-32 as
defined in the standard. You want to introduce a UTF-8s, which
would put us in the following situation:

                               |==> <ED A0 80 ED B0 80> UTF-8s
                               |==> <F0 90 80 80> UTF-8
                U-00010000 <==|==> <D800 DC00> UTF-16
                               |==> <00010000> UTF-32

Then for interworking, you would choose UTF-8s and UTF-16, since
they have the identical binary ordering properties you want,
and simplify your conversion and allocation handling as well.

Now the conundrum that Doug and Peter are putting out to you is
what do you do about the handling of isolated surrogates, which
the standard also requires you to have a unique sequence for
(if we consider them to be Unicode scalar values)? Thus:

                               |==> <ED A0 80> UTF-8s
                               |==> <ED A0 80> UTF-8
                U-0000D800 <==|==> <D800> UTF-16
                               |==> <0000D800> UTF-32

Now let's put two of those isolated surrogate code points
together in sequence:
                               |==> <ED A0 80 ED B0 80> UTF-8s
                               |==> <ED A0 80 ED B0 80> UTF-8
  <U-0000D800, U-0000DC00> <==|==> <D800 DC00> UTF-16
                               |==> <0000D800 0000DC00> UTF-32

Here, arguably, both UTF-32 and UTF-8 would maintain a unique,
roundtrippable distinction between two isolated surrogate
code points (i.e. Unicode scalar values) in sequence, and
an ordinary supplemental code point. However, UTF-16 and
UTF-8s would not. For UTF-16 this is understandable, since
it was *designed* that way. It cannot really represent sequences of
isolated surrogate code points, since it uses surrogate code
*units* as part of the transformation. But by making UTF-8s
mimic UTF-16, the problem gets worse. The UTF-8s sequence
cannot distinguish the two either, so it is failing of
the "unique sequence" requirement. But what is worse, the
supposedly regular UTF-8s sequence cannot be distinguished from
the *irregular* UTF-8 sequence for the same thing.

Personally, I think there are other conundrums in the last two
examples, as applied to UTF-16, that would lead me to prefer
restricting "Unicode scalar value" itself to non-surrogate
code points for the purposes of the definition of the UTF's,
and then leave the last two examples to the error-handling
exceptions. But in any case, the introduction of UTF-8s
doesn't make the situation better for these definitions --
it just creates more points of confusion and inconsistency
in the definitions.

--Ken

> Peter_Constable@sil.org wrote:
>
> > On 06/07/2001 10:38:15 AM DougEwell2 wrote:
> >
> > >The ambiguity comes from the fact that, if I am using UTF-8s and I want to
> > >represent the sequence of (invalid) scalar values <D800 DC00>, I must use
> > the
> > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> > (valid)
> > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED
> > B0
> > >80>. Unless you have a crystal ball or are extremely good with tarot
> > cards,
> > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
> > >80>, to know whether it is supposed to be mapped back to <D800 DC00> or to
> > ><10000>.
> >



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT