Re: UTF-8 syntax

From: DougEwell2@cs.com
Date: Thu Jun 07 2001 - 11:38:15 EDT


In a message dated 2001-06-07 1:03:04 Pacific Daylight Time,
Peter_Constable@sil.org writes:

> >But definition D29 says that a UTF must round-trip these invalid code
> points,
> >so we have no choice but to interpret them as <D800 DC00>. That is why
> >UTF-8s is ambiguous. The sequence <ED A0 80 ED B0 80> could be mapped as
> >either <D800 DC00>, because D29 says you have to allow for that, or as
> ><10000>, because that is the real intent.
>
> Well, I don't find round-trip implied in D29, but it does say that the
> mapping from the CCS to 8-bit code sequences is unique:

The (unnumbered) paragraph immediately following D29 is what I was referring
to:

<quote emphasis=original>
Because every Unicode coded character sequence maps to a unique sequence of
code values in a given UTF, a reverse mapping can be derived. Thus every UTF
supports *lossless round-trip transcoding*: mapping from any Unicode coded
character sequence S to a sequence of code values and back will produce S
again. To ensure that round-trip transcoding is possible, a UTF mapping
*must also* map invalid Unicode scalar values to unique code value sequences.
 These invalid scalar values include FFFE, FFFF, and unpaired surrogates.
</quote>

I assume this paragraph, although unnumbered, is intended to supplement and
clarify D29, and so in a sense it is part of D29. (What other reason could
it have for being there?)

(N.B. The list of invalid scalar values also includes *all* values of the
form U+xxFFFE and U+xxFFFF, as well as U+FDD0 through U+FDEF.)

> >But definition D29 says that a UTF must round-trip these invalid code
> points,
> >so we have no choice but to interpret them as <D800 DC00>. That is why
> >UTF-8s is ambiguous.
>
> Not so. All that D29 imposes on UTF-8s is that its mapping from codepoints
> to code units must be injective; i.e. there can be only one sequence for
> any given codepoint. It does not make any further requirements as to the
> nature of the mapping. Therefore, it is possible for UTF-8s to specify that
> the represention of U+10000 is <ED A0 80 ED B0 80> (or anything else, for
> that matter), but it can only specify one representation. D29 requires that
> any UTF-8s, if it were to be defined in Unicode, could *not* be ambiguous.

The ambiguity comes from the fact that, if I am using UTF-8s and I want to
represent the sequence of (invalid) scalar values <D800 DC00>, I must use the
UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the (valid)
scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED B0
80>. Unless you have a crystal ball or are extremely good with tarot cards,
you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
80>, to know whether it is supposed to be mapped back to <D800 DC00> or to
<10000>.

I mean, yes, you do have a way. The great *likelihood* is that you want to
represent the valid Unicode code point, not a sequence of two lonely
surrogate code points that just coincidentally happen to appear together.
But this heuristic does not answer the requirement of the paragraph following
D29 that a UTF must map code points to code units unambiguously.

> - Contrary to Doug, a UTF-8s could not be made ambiguous if it were defined
> in Unicode. No argument on this basis against a proposed UTF-8s has been
> made.

Premise: Unicode should not, and does not, define ambiguous UTFs.
    I think we agree on this.

Premise: UTF-8s is ambiguous in its handling of surrogate code points.
    I tried to prove this above.

Conclusion: Unicode should not define UTF-8s.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT