Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 12 2001 - 21:59:12 EDT


Jianping responded:

> Kenneth Whistler wrote:
>
> > Jianping wrote:
> >
> > > One thing needs to clarify here is that there is no four byte encoding in
> > > UTF-8S proposal and four byte encoding is illegal but not irregular. As
> > > everything in UTF-8S is perfect match to UTF-16, any blame to this proposal
> > > also applies to UTF-16 encoding form.
> >
> > Well after a couple months arguing about this, it is nice to have
> > this little detail drop into place. Perhaps in another couple
> > of months we could have a complete specification, and then
> > restart the argument.
>
> This is not truth. In its very beginning, we stated that the supplementary character
> will be encoded as a pair of three bytes. That's why we have a new proposal,
> otherwise the proposal will be groundless as it will be same as UTF-8.

You're missing the point. To date the specification for UTF-8s
has not been complete. People have openly speculated on the
list (before my message under this new topic) about what the
status of a four-byte supplementary character representation in
UTF-8s would be.

You did state that supplementary characters are encoded in UTF-8s
as pairs of three bytes. Everybody understood that. That was the
well-formedness condition.

What you didn't state were the ill-formedness conditions. Was
valid UTF-8 considered allowable or not under UTF-8s? Was it
ill-formed, and if so, how interpreted, or not?

What you finally stated today is that <F0 90 80 80> is flat-out
*illegal* in UTF-8s. That was a missing piece of the puzzle for anyone
trying to interpret what you are proposing.

> > =============================================================
> >
> > What Jianping is saying now is that F0..F4 are illegal as
> > initiators in UTF-8s. (They are legal initiators in UTF-8.)
> >
> > Also, judging from his statement that "everything in UTF-8S is
> > perfect match to UTF-16", it is quite clear that UTF-8s does
> > *not* meet the Unicode Standard's definition of a UTF. To be
> > a UTF, it has to be a reversible transform of code points (or
> > Unicode scalar values -- there is some argument about which).
> >
>
> Does UTF-16 meet? If UTF-16 does, UTF-8S should.

UTF-16 is a reversible transform of Unicode scalar values
to 16-bit code units.

By contrast, you have conceived and defined UTF-8s as a
reversible transform of UTF-16 16-bit code units to 8-bit
code units.

Not the same thing.

Presumably UTF-8s could also be defined as a UTF, but
that isn't how you have been presenting or (apparently)
conceiving it.

> > But UTF-8s is designed and conceived as a CODE UNIT TRANSFORM
> > of UTF-16. (A "CUT", not a "UTF".)
> >
> > Basically, instead of starting with the code points, and deriving
> > the three UTF's, for UTF-8s you start with UTF-16 and derive
> > UTF-8s directly from it. (This is why I have been pounding on
> > the point that in order to understand the Oracle proposal, you
> > have to think in terms of the UTF-16 <==> UTF-8 convertors,
> > rather than in terms of the definitional UTF's.)
> >
>
> This is your perception.

Yep, and I stand by it.

> > In other words, while others are seeing:
> >
> > U-00010000 ==> ED A0 80 ED B0 80 in UTF-8s
> > ==> D800 DC00 in UTF-16
> >
> > Oracle is seeing:
> >
> > (D800)(DC00) <==> (ED A0 80)(ED B0 80)
> >
>
> That's also your perception but not Oracle as we already support standard UTF-8
> encoding in 9i.

How is Oracle's support for standard UTF-8 relevant to the conceptual
definition of UTF-8s?

> > and pointing out the tremendous simplicity of the fact that
> > a code point, err... code unit in UTF-16 always corresponds
> > *exactly* to a code point, errr... well a 1-, 2-, or 3- code
> > unit sequence in UTF-8s that always corresponds to a, umm..
> > character, well, sort of.
> >
>
> It is meaningless to examine each bytes of UTF-8S encoding, and this also applies to
> UTF-8. What is code unit in UTF-8S should be 1-, 2- or 3-bytes unit, and one or two
> code-unit will be one codepoint. If we still look at each byte of UTF-8S/UTF-S and
> make random truncation, you will get meaningless bytes there. The best practice here
> is that you have to treat this 1-, 2-, or 3-bytes encoding as one unit.

Beautifully put. I think you have just confirmed my argument.

> > Now, perhaps Jianping will care to step in an clarify how UTF-32
> > fits in this picture. How, for example, are the irregular UTF-32
> > sequences in k and l above to be treated? As I have indicated?
> > (in which case, as Peter points out, there is an ambiguity in
> > the interpretation of any 6-byte UTF-8s representation) Or in
> > some other manner? And if so, how so?
> >
>
> Before answering these questions, just replace UTF-8S by UTF-16, can you give me
> good answers here? If this is any ambiguity for UTF-8S, so as UTF-16.

Certain I can give you a good answer. Return to Case I of my original document.

In that formulation, the code point U-0000D800 and the code point
U-0000DC00 are not mapped by the UTF's at all. <U-0000D800, U-0000DC00>
is just an illegal representation.

The UTF-16 code unit sequence <D800 DC00> *always* corresponds to U+10000.
It also always corresponds to the UTF-32 code unit sequence <00010000>
and the UTF-8 code unit sequence <F0 90 80 80>.

No ambiguities, no mapping issues.

Now please answer the question for UTF-32 under your formulation of
UTF-8s.

> I don't think there is ground here to argue this syntax or semantics issue as UTF-8S
> should meet the standard requirement exactly the same way as UTF-16. I think the key
> issue here is its benefit and its implication to the implementor, and I think we
> should get a best balance between these two.

No comment.

--Ken

>
> Regards,
> Jianping.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT