Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Mark Davis (markdavis34@home.com)
Date: Tue Jun 05 2001 - 17:12:59 EDT


I am not an advocate of UTF-8s -- I am just trying to dispell some of the
noise here. I have some specific answers below, but in general:

1. Strict means according to the Unicode definition. See tr27. XML and IETF
do use a strict definition, which is perfectly acceptable according to the
Unicode Standard, but that is not the meaning of "strict" that I was using.

2. It would, of course, be illegal in XML to supply a document in UTF-8S,
and give no encoding declaration, and have it read 6-byte surrogates. That's
a strawman. It is also illegal in XML to supply a document in Latin-1, and
give no encoding declaration, and have it read the byte A0.

If "UTF-8S" were registered, it would be *perfectly legal* XML to supply the
encoding declaration with "UTF-8S", then include a 6-byte supplementary
character.

Actually, it is legal even if it were not registered. The XML spec
(http://www.w3.org/TR/2000/REC-xml-20001006#charencoding) says "It is
recommended that character encodings registered (as charsets) with the
Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just
listed, be referred to using their registered names; other encodings should
use names starting with an "x-" prefix.".

So even now one could use "x-UTF-8S", and be within the recommendations.

3. I am not the one promoting UTF-8S, so I simply guessed that they would
want to have it be parallel to the Unicode definition for UTF-8. They could,
if they wanted to, be only strict -- and not accept 4-byte forms.

For me, that would be the one positive for defining UTF-8S: we could then
tighten up the definition of UTF-8 to require it to exclude 6-byte forms on
input. You could then have:

UTF-8: only emits 4byte, only reads 4byte
UTF-8S: only emits 6byte, only reads 6byte

For reading text where the encoding was unknown, you could have an
autodetecter, one that reads stream and decides between UTF-8 and UTF-8S. If
you passed a data stream that mixed 4 and 6 byte supplementaries, the
autodetecter would fail, since it is neither.

Mark

----- Original Message -----
From: <Peter_Constable@sil.org>
To: <unicode@unicode.org>
Sent: Tuesday, June 05, 2001 11:05
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

>
>
> On 06/05/2001 09:30:00 AM Mark Davis wrote:
>
> >I put samples on:
> >
> >http://www.macchiato.com/utc/samples_of_utf8.htm
>
> One thing doesn't make sense here: you have "strict" under UTF-8s. Strict
> in relation to what? Strict in relation to UTF-8 had to do with XML. Under
no
> XML, *all* the entries under the strict heding for UTF-8s would have to be
no
> errors: the first three because XML doesn't allow them, and the latter
> three because UTF-8s (presumably) doesn't allow them.
no
>
> That brings up one more issue for UTF-8s: what would the status be of a
> four-byte sequence in UTF-8s? In UTF-8, they are legal but irregular, and
up to proponents
> conformant software is allowed to interpret (but may also chose to reject)
> but is not allowed to generate. Would a UTF-8s spec say that it's illegal
> to generate four-byte sequences? That could be problematic unless it will
has to be illegal to generate (otherwise is not determinate). May be legal
to read. Up to proponents.
> always be clear whether something is UTF-8 or UTF-8s (but yesterday Mark
> was suggesting that that shouldn't matter). Would it say that it's illegal
no. What I said was that if you were trying to *autodetect* text, it doesn't
matter.
> to interpret a four-byte sequence, making that spec stricter than the
UTF-8
up to proponents
> spec? Again, that would only be possible if it's always clear whether
> something is UTF-8 or UTF-8s, but I think we agree that already isn't the
no.
> case. So, it would have to say that it's legal to interpret four-byte
no
> sequences. Would it say that a process is allowed to reject four-byte
> sequences?
no
>
> It's unfortunate that some things can't be undone. I think it might have
> saved us a bunch of trouble if it had been stated back in 1996 that *all*
> non-shortest-form UTF-8 sequences, including 6-byte
> supplementary-chars-via-surrogates sequences, were *illegal*. We certainly
> wouldn't be having this discussion today.
If pigs could fly,...
>
>
> - Peter
>
>
> --------------------------------------------------------------------------
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT