Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Peter_Constable@sil.org
Date: Tue Jun 05 2001 - 14:05:10 EDT


On 06/05/2001 09:30:00 AM Mark Davis wrote:

>I put samples on:
>
>http://www.macchiato.com/utc/samples_of_utf8.htm

One thing doesn't make sense here: you have "strict" under UTF-8s. Strict
in relation to what? Strict in relation to UTF-8 had to do with XML. Under
XML, *all* the entries under the strict heding for UTF-8s would have to be
errors: the first three because XML doesn't allow them, and the latter
three because UTF-8s (presumably) doesn't allow them.

That brings up one more issue for UTF-8s: what would the status be of a
four-byte sequence in UTF-8s? In UTF-8, they are legal but irregular, and
conformant software is allowed to interpret (but may also chose to reject)
but is not allowed to generate. Would a UTF-8s spec say that it's illegal
to generate four-byte sequences? That could be problematic unless it will
always be clear whether something is UTF-8 or UTF-8s (but yesterday Mark
was suggesting that that shouldn't matter). Would it say that it's illegal
to interpret a four-byte sequence, making that spec stricter than the UTF-8
spec? Again, that would only be possible if it's always clear whether
something is UTF-8 or UTF-8s, but I think we agree that already isn't the
case. So, it would have to say that it's legal to interpret four-byte
sequences. Would it say that a process is allowed to reject four-byte
sequences?

It's unfortunate that some things can't be undone. I think it might have
saved us a bunch of trouble if it had been stated back in 1996 that *all*
non-shortest-form UTF-8 sequences, including 6-byte
supplementary-chars-via-surrogates sequences, were *illegal*. We certainly
wouldn't be having this discussion today.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT