Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: DougEwell2@cs.com
Date: Tue Jun 05 2001 - 22:18:27 EDT


In a message dated 2001-06-05 14:24:38 Pacific Daylight Time,
markdavis34@home.com writes:

> For me, that would be the one positive for defining UTF-8S: we could then
> tighten up the definition of UTF-8 to require it to exclude 6-byte forms on
> input. You could then have:
>
> UTF-8: only emits 4byte, only reads 4byte
> UTF-8S: only emits 6byte, only reads 6byte

But there is still a problem, because of definition D29. All UTFs have to be
able to encode non-character code points, including 0xD800 through 0xDFFF.
That means -- as unlikely as it is in the real world -- you could have a
UTF-8 code sequence that represents an unpaired surrogate, and you have to
consider it as valid, strict UTF-8 (although you can reject the unpaired
surrogate itself).

I don't like definition D29 personally, but the experts (in particular Mark)
have assured me that it is necessary and justified. In my view, D29 just
throws another monkey wrench into UTF-8S.

Remember that, to handle characters above U+FFFF, a UTF-8S processor would
not actually emit and read 6-byte sequences per se. It would emit and read
*pairs of 3-byte sequences*. The processor then has to put the two together,
using the UTF-16 rules.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT