On 06/12/2001 09:21:06 PM Jianping Yang wrote:
>Nobody except you though that 4-byte is allowed in UTF-8S.
Not so! That was under discussion just a few days ago:
<quote>
On 06/07/2001 12:34:49 AM DougEwell2 wrote:
[snip]
>But definition D29 says that a UTF must round-trip these invalid code
points,
>so we have no choice but to interpret them as <D800 DC00>. That is why
>UTF-8s is ambiguous. The sequence <ED A0 80 ED B0 80> could be mapped as
>either <D800 DC00>, because D29 says you have to allow for that, or as
><10000>, because that is the real intent.
>
>Note that UTF-8 is not ambiguous in this regard, unless you permit these
>so-called "lenient" processors, which I thought were made non-conformant
by
>the Corrigendum. The sequence <ED A0 80 ED B0 80> is every bit as much
>"overlong" as is <C0 80>.
[snip]
</quote>
I believe Ken is quite correct that this is the first time one of the
proponents has given an unambiguous statement on this point. Maybe you've
always thought that UTF-8S didn't allow 4-byte sequences; a lot of us were
trying to figure out what you guys were meaning, and you weren't telling us
until now. Thank you for making that point clear.
- Peter
---------------------------------------------------------------------------
Peter Constable
Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT