RE: UTF-8 syntax

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Thu Jun 07 2001 - 11:54:36 EDT


Marco,

The biggest problem with UTF-8s is that most things will work. For example
if it does not thoroughly validate the UTF-8 most converters will convert
UTF-8s to UTF-16 properly. (the UTF-8 to UTF-32 converters might produce
bad results) Now we convert back from UTF-16 to UTF-8. The text which had
6 byte sequences now has 4 byte sequences. Now I put it back in the
database which does not thoroughly validate the string. This is added
overhead which would significantly impair the transaction performance. The
original record does not match so the new record rather than replacing the
old one is added. You end up with a database anomaly that no one catches.

The worst errors are the ones that almost work. Blatant errors are easy to
find but the insidious errors are the nasty ones. This reminds be of an IBM
360 OS operating system error where the default processing for tape errors
was to accept the bad data when you got a write error. We found the bug
because we had bad data on out payroll master tape. Because the errors were
not obvious the error was on the previous master file, the grandfather file,
etc. The error went back to before our archive cycle.

I am willing to bet my earnings for the next 10 years that if approved,
UTF-8s will bite someone with an undetected problem.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Marco Cimarosti
Sent: Thursday, June 07, 2001 3:18 AM
To: unicode@unicode.org; 'Peter_Constable@sil.org'
Subject: RE: UTF-8 syntax

Peter Constable wrote:
> As I mentioned in an earlier message, the definitions in
> Unicode are less explicit when it comes to interpretation
> than they are with regard to encoding.

Perhaps we see interpretation rules unclear NOW, because this discussion has
been mixing up UTF-8 and UTF-8s.

The reason for this confusion, I think, is that someone tried to say that
UTF-8s will change nothing because the premises for it are already in
existing UTF-8.

This was such a good news that, for a moment, we all wanted to believe it.
Pity that it was not true.

> We are left to infer that "mapped back" means the exact inverse
> of the mapping defined (in the case of UTF-8) in D36. But note:
> making that inference assumes that the mapping in D36 is invertible.
> That requires that the mapping in D36 is injective; i.e. one-to-one,
> as D29 requires. This reinforces that a 6-byte sequence cannot be
> used to represent a supplementary plane character.
> But not the corrolary: 6-byte sequences cannot be mapped back to
> a Unicode scalar value, and therefore *are illegal*.

Perhaps the whole matter becomes much simpler if you put it in this terms:
6-byte sequences NEVER existed in UTF-8!

So, <ED A0 80 ED B0 80> is NOT a six-byte sequence: it is two adjacent
THREE-byte sequences: <ED A0 80> and <ED B0 80>, and the meaning of these
sequences is already clear enough by the rules (Table 3.1B): the first one
means U+D800 and the second one means U+DC00.

That's why conformance rules say nothing explicit about "6-byte sequences".
It is for the same reason that they say nothing explicit about a "4-byte
sequence" like <63 69 61 6F>...

Possible lesson: if UTF-8s will be approved, six-byte sequences will start
to exist in *that* UTF only. Implementers who will decide to support UTF-8s,
will have to write a new parser for it: they will not be able to use their
existing UTF-8 parser pretending that there is no difference.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT