Re: UTF-8 syntax

From: DougEwell2@cs.com
Date: Thu Jun 07 2001 - 01:34:49 EDT


In a message dated 2001-06-06 9:35:45 Pacific Daylight Time,
Peter_Constable@sil.org writes:

> we see that Unicode does not *exclude* D800 and DC00 from the
> codespace for the CCS, and therefore it would seem that that UTF-8 sequence
> would have to be interpreted (in the encoding form level of interpretation)
> as the code points < D800 DC00 >, which have *no* meaning *as codepoints*!!

But definition D29 says that a UTF must round-trip these invalid code points,
so we have no choice but to interpret them as <D800 DC00>. That is why
UTF-8s is ambiguous. The sequence <ED A0 80 ED B0 80> could be mapped as
either <D800 DC00>, because D29 says you have to allow for that, or as
<10000>, because that is the real intent.

Note that UTF-8 is not ambiguous in this regard, unless you permit these
so-called "lenient" processors, which I thought were made non-conformant by
the Corrigendum. The sequence <ED A0 80 ED B0 80> is every bit as much
"overlong" as is <C0 80>.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT