RE: UTF-8 syntax

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Jun 07 2001 - 06:17:37 EDT


Peter Constable wrote:
> As I mentioned in an earlier message, the definitions in
> Unicode are less explicit when it comes to interpretation
> than they are with regard to encoding.

Perhaps we see interpretation rules unclear NOW, because this discussion has
been mixing up UTF-8 and UTF-8s.

The reason for this confusion, I think, is that someone tried to say that
UTF-8s will change nothing because the premises for it are already in
existing UTF-8.

This was such a good news that, for a moment, we all wanted to believe it.
Pity that it was not true.

> We are left to infer that "mapped back" means the exact inverse
> of the mapping defined (in the case of UTF-8) in D36. But note:
> making that inference assumes that the mapping in D36 is invertible.
> That requires that the mapping in D36 is injective; i.e. one-to-one,
> as D29 requires. This reinforces that a 6-byte sequence cannot be
> used to represent a supplementary plane character.
> But not the corrolary: 6-byte sequences cannot be mapped back to
> a Unicode scalar value, and therefore *are illegal*.

Perhaps the whole matter becomes much simpler if you put it in this terms:
6-byte sequences NEVER existed in UTF-8!

So, <ED A0 80 ED B0 80> is NOT a six-byte sequence: it is two adjacent
THREE-byte sequences: <ED A0 80> and <ED B0 80>, and the meaning of these
sequences is already clear enough by the rules (Table 3.1B): the first one
means U+D800 and the second one means U+DC00.

That's why conformance rules say nothing explicit about "6-byte sequences".
It is for the same reason that they say nothing explicit about a "4-byte
sequence" like <63 69 61 6F>...

Possible lesson: if UTF-8s will be approved, six-byte sequences will start
to exist in *that* UTF only. Implementers who will decide to support UTF-8s,
will have to write a new parser for it: they will not be able to use their
existing UTF-8 parser pretending that there is no difference.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT