Re: UTF-8 syntax

From: DougEwell2@cs.com
Date: Thu Jun 07 2001 - 01:34:49 EDT

Next message: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Previous message: Mark Davis: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Next in thread: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: DougEwell2@cs.com: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Misha Wolf: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a message dated 2001-06-06 9:35:45 Pacific Daylight Time,
Peter_Constable@sil.org writes:

> we see that Unicode does not *exclude* D800 and DC00 from the
> codespace for the CCS, and therefore it would seem that that UTF-8 sequence
> would have to be interpreted (in the encoding form level of interpretation)
> as the code points < D800 DC00 >, which have *no* meaning *as codepoints*!!

But definition D29 says that a UTF must round-trip these invalid code points,
so we have no choice but to interpret them as <D800 DC00>. That is why
UTF-8s is ambiguous. The sequence <ED A0 80 ED B0 80> could be mapped as
either <D800 DC00>, because D29 says you have to allow for that, or as
<10000>, because that is the real intent.

Note that UTF-8 is not ambiguous in this regard, unless you permit these
so-called "lenient" processors, which I thought were made non-conformant by
the Corrigendum. The sequence <ED A0 80 ED B0 80> is every bit as much
"overlong" as is <C0 80>.

-Doug Ewell
Fullerton, California

Next message: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Previous message: Mark Davis: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Next in thread: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: DougEwell2@cs.com: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Misha Wolf: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8 syntax"
Maybe reply: Kenneth Whistler: "Re: UTF-8 syntax"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT