Re: UTF-8 syntax

From: Peter_Constable@sil.org
Date: Thu Jun 07 2001 - 03:32:45 EDT


[ copying to unicoRe as I think there are concerns relevant regarding poor
handling of the definitions in TUS, and more importantly some problems with
the definitions ]

On 06/07/2001 12:34:49 AM DougEwell2 wrote:

>But definition D29 says that a UTF must round-trip these invalid code
points,
>so we have no choice but to interpret them as <D800 DC00>. That is why
>UTF-8s is ambiguous. The sequence <ED A0 80 ED B0 80> could be mapped as
>either <D800 DC00>, because D29 says you have to allow for that, or as
><10000>, because that is the real intent.

Well, I don't find round-trip implied in D29, but it does say that the
mapping from the CCS to 8-bit code sequences is unique:

<quote>
D29 A [Unicode encoding form] transforms each Unicode scalar value into a
unique sequence of code values.
</quote>

 Thus, U+10000 can be encoded in *only one way* in UTF-8 (or in UTF-8s or
any other encoding form). D29 states that ambiguity is not allowed.

Also D36 indicates that codepoints are encoded into code units as specified
in Table 3.1:

<quote>
D36 UTF-8 is the [encoding form] that serializes a Unicode scalar value as
a sequence of one to four bytes, as specified in Table 3.1.
</quote>

That table clearly requires that U+10000 be encoded as <F0 90 80 80> (and
D29 tells us that can be the *only* way - not to mention that D36 clearly
limits to 4-byte sequences). Also, since there is no limitation placed upon
the range of code points for which this is defined, and since D800 is not
excluded from the codespace, then U+D800 must be encoded as <ED A0 80>. And
so, <ED A0 80> must be interpreted as U+D800. Similarly in the case of
DC00. But the crucial point in this is that ***this is talking about the
codepoint U+D800 in the Unicode coded character set*** and NOT about a
UTF-16 code unit. Ditto for DC00. Thus, the definitions as they pertain to
UTF-8 simply ***do NOT make ANY allowance for <ED A0 80 ED B0 80> to be
interpreted as U+10000. It is pure *fantasy* that some have tried to
conventionalise. (Cf comments on UTF-8s below.)

>Note that UTF-8 is not ambiguous in this regard, unless you permit these
>so-called "lenient" processors, which I thought were made non-conformant
by
>the Corrigendum.

The *definitions* before the Corrigendum were ambiguous as to whether the
unique representation of e.g. U+0020 was supposed to be <20> or <C0 A0>
etc. The prose note that followed stated "the shortest form that can
represent those values shall be used", but that wasn't clearly in the
definition proper. The corrigendum left no doubt.

However, the definitions before the Corrigendum were ***not in any way***
ambigous with regard to supplementary plane characters. The ***only***
sanctioned representation by those definitions was using 4 bytes. The
Corrigendum did not change that one iota.

As I mentioned in an earlier message, the definitions in Unicode are less
explicit when it comes to interpretation than they are with regard to
encoding. For example, D31 says that illegal code values sequences are
those "that cannot be mapped back to any sequence of Unicode scalar
values". The problem with this is that the meaning of "mapped back" is
nowhere defined. The result is that "illegal code value sequence" and
"irregular code value sequence" are strictly speaking not well defined. We
are left to infer that "mapped back" means the exact inverse of the mapping
defined (in the case of UTF-8) in D36. But note: making that inference
assumes that the mapping in D36 is invertible. That requires that the
mapping in D36 is injective; i.e. one-to-one, as D29 requires. This
reinforces that a 6-byte sequence cannot be used to represent a
supplementary plane character. But not the corrolary: 6-byte sequences
cannot be mapped back to a Unicode scalar value, and therefore *are
illegal*. This in spite of the fact that D36(c) in the corregendum defines
these as "irregular", which itself is defined in D32 as "ill-formed [but]
not illegal". Thus, if we make the inference regarding the meaning of
"mapped back" in D31 that seems likely, then D36(c) is logically
inconsistent with D32.

Again, this entire business of thinking that a 6-byte UTF-8 sequence can
mean a supplementary-plane character is absolute hogwash that treats the
definitions in an incredibly sloppy manner. The only way to maintain the
notion that <ED A0 80 ED B0 80> is to make a different inference regarding
the meaning of "mapped back" - something other than the inverse of the
injective mapping in Table 3.1. But that assumption is absolutely wide
open: we would be left we no limitation as to what "mapped back" actually
means. Thus, we could map any choice of <A0> or <97> or <C0 80> back to
U+10000 if we wanted to. But that is absolutely ludicrous. Having ruled out
the alternative, the notion that a 6-byte sequences can be mapped back into
a supplemantary-plane character, or any other character, is absolutely
ludicrous.

> The sequence <ED A0 80 ED B0 80> is every bit as much
>"overlong" as is <C0 80>.

Absolutely.

That has been UTF-8. Now, coming back to UTF-8s:

>But definition D29 says that a UTF must round-trip these invalid code
points,
>so we have no choice but to interpret them as <D800 DC00>. That is why
>UTF-8s is ambiguous.

Not so. All that D29 imposes on UTF-8s is that its mapping from codepoints
to code units must be injective; i.e. there can be only one sequence for
any given codepoint. It does not make any further requirements as to the
nature of the mapping. Therefore, it is possible for UTF-8s to specify that
the represention of U+10000 is <ED A0 80 ED B0 80> (or anything else, for
that matter), but it can only specify one representation. D29 requires that
any UTF-8s, if it were to be defined in Unicode, could *not* be ambiguous.

To the extent that anybody is making use of 6-byte sequences to represent
supplementary-plane characters today, they are already implementing the
non-standard UTF-8s. Let's be clear on one thing: they are not implementing
a variation on UTF-8. (The definitions for UTF-8 do not allow for these
variations, as I demonstrate above.)

But that does *not* require UTC to reify this as a standard. There are a
whole lot of people out there using non-standard character encodings. For
example, there's a bunch of users out there with data in which the code
unit <80> represents a Devanagari DDHA (or something comparable). Does that
mean that if a group of users can afford $10,000 and create an organisation
to represent them, that they can the become full members of the Consortium,
attend UTC meetings and start convincing the committee that there should be
a UTF-8x in which the code unit <80> represents U+0922? Of course not. The
mere existence of implementations should not compel UTC to create a new
standard encoding form.

The only things that should compel them to do so are (i) if implementations
are so widespread as to be a de facto standard such that ignoring would
amount to becoming irrelevant, or (ii) there are compelling technical
reasons why it would be A Good Thing. In the case of UTF-8s, the technical
reasons for UTC to make it a standard encoding form have not been shown to
be compelling. On the contrary, the reasons *not* to do so are several, and
at least as compelling if not much more so. As for existing implementations
of UTF-8s, they are decidedly not widespread at this time. Moreover, it
will make more sense for UTC to oppose that from happening precisely
because of the lack of compelling technical reasons for it and, more to the
point, the technical reasons against it. Key among them is the point Rick
McGowan has made: a multiplicity of encoding forms does not benefit us, but
only recreates the confusion from which Unicode was intended to extricate
us.

In summary:

- As Doug pointed out, the definitions require that e.g. a code unit
sequence <ED A0 80 ED B0 80> must be interpreted as the sequence of Unicode
codepoints < U+D800, U+DC00 > and that it cannot be interpreted as U+10000.

- The definitions in Unicode have never been ambiguous as to the
representation of supplementary-plane characters in UTF-8 and have never
allowed for 6-byte sequences. Thus, the entire notion that <ED A0 80 ED B0
80> can be construed as a UTF-8 sequence meaning U+10000 is grounded in a
disregard and violation of the definitions of the Unicode standard. I
maintain that it never should have been and should not now be tolerated.

- D36(c) is logically inconsistent with D32. (Either that, or the
defintions make the rules of UTF-8 encoding tight but leave the
interpretation wide open.)

- Contrary to Doug, a UTF-8s could not be made ambiguous if it were defined
in Unicode. No argument on this basis against a proposed UTF-8s has been
made.

- There are (presumably) some existing implementations using a private
encoding form, UTF-8s (the 6-byte "non-shortest" way of representing
supplementary-plane characters which some have considered deviant UTF-8 but
which by the definitions cannot be considered in any way to be UTF-8). The
existence of such implementations does not alone constitute a reason for
UTC to sanction UTF-8s.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT