Fw: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Mark Davis (mark@macchiato.com)
Date: Wed Jun 13 2001 - 11:16:51 EDT


[I had sent out a message (now gone) saying that I was guessing the
proponents of UTF-8S meant something like
http://www.macchiato.com/utc/samples_of_utf8.htm as their definition. It
would really help to get a complete proposal.]

----- Original Message -----
From: "Mark Davis" <markdavis34@home.com>
To: <Peter_Constable@sil.org>; <unicode@unicode.org>
Sent: Tuesday, June 05, 2001 14:12
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> I am not an advocate of UTF-8s -- I am just trying to dispell some of the
> noise here. I have some specific answers below, but in general:
>
> 1. Strict means according to the Unicode definition. See tr27. XML and
IETF
> do use a strict definition, which is perfectly acceptable according to the
> Unicode Standard, but that is not the meaning of "strict" that I was
using.
>
> 2. It would, of course, be illegal in XML to supply a document in UTF-8S,
> and give no encoding declaration, and have it read 6-byte surrogates.
That's
> a strawman. It is also illegal in XML to supply a document in Latin-1, and
> give no encoding declaration, and have it read the byte A0.
>
> If "UTF-8S" were registered, it would be *perfectly legal* XML to supply
the
> encoding declaration with "UTF-8S", then include a 6-byte supplementary
> character.
>
> Actually, it is legal even if it were not registered. The XML spec
> (http://www.w3.org/TR/2000/REC-xml-20001006#charencoding) says "It is
> recommended that character encodings registered (as charsets) with the
> Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just
> listed, be referred to using their registered names; other encodings
should
> use names starting with an "x-" prefix.".
>
> So even now one could use "x-UTF-8S", and be within the recommendations.
>
> 3. I am not the one promoting UTF-8S, so I simply guessed that they would
> want to have it be parallel to the Unicode definition for UTF-8. They
could,
> if they wanted to, be only strict -- and not accept 4-byte forms.
>
> For me, that would be the one positive for defining UTF-8S: we could then
> tighten up the definition of UTF-8 to require it to exclude 6-byte forms
on
> input. You could then have:
>
> UTF-8: only emits 4byte, only reads 4byte
> UTF-8S: only emits 6byte, only reads 6byte
>
> For reading text where the encoding was unknown, you could have an
> autodetecter, one that reads stream and decides between UTF-8 and UTF-8S.
If
> you passed a data stream that mixed 4 and 6 byte supplementaries, the
> autodetecter would fail, since it is neither.
>
> Mark
>
> ----- Original Message -----
> From: <Peter_Constable@sil.org>
> To: <unicode@unicode.org>
> Sent: Tuesday, June 05, 2001 11:05
> Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
>
>
> >
> >
> > On 06/05/2001 09:30:00 AM Mark Davis wrote:
> >
> > >I put samples on:
> > >
> > >http://www.macchiato.com/utc/samples_of_utf8.htm
> >
> > One thing doesn't make sense here: you have "strict" under UTF-8s.
Strict
> > in relation to what? Strict in relation to UTF-8 had to do with XML.
Under
> no
> > XML, *all* the entries under the strict heding for UTF-8s would have to
be
> no
> > errors: the first three because XML doesn't allow them, and the latter
> > three because UTF-8s (presumably) doesn't allow them.
> no
> >
> > That brings up one more issue for UTF-8s: what would the status be of a
> > four-byte sequence in UTF-8s? In UTF-8, they are legal but irregular,
and
> up to proponents
> > conformant software is allowed to interpret (but may also chose to
reject)
> > but is not allowed to generate. Would a UTF-8s spec say that it's
illegal
> > to generate four-byte sequences? That could be problematic unless it
will
> has to be illegal to generate (otherwise is not determinate). May be legal
> to read. Up to proponents.
> > always be clear whether something is UTF-8 or UTF-8s (but yesterday Mark
> > was suggesting that that shouldn't matter). Would it say that it's
illegal
> no. What I said was that if you were trying to *autodetect* text, it
doesn't
> matter.
> > to interpret a four-byte sequence, making that spec stricter than the
> UTF-8
> up to proponents
> > spec? Again, that would only be possible if it's always clear whether
> > something is UTF-8 or UTF-8s, but I think we agree that already isn't
the
> no.
> > case. So, it would have to say that it's legal to interpret four-byte
> no
> > sequences. Would it say that a process is allowed to reject four-byte
> > sequences?
> no
> >
> > It's unfortunate that some things can't be undone. I think it might have
> > saved us a bunch of trouble if it had been stated back in 1996 that
*all*
> > non-shortest-form UTF-8 sequences, including 6-byte
> > supplementary-chars-via-surrogates sequences, were *illegal*. We
certainly
> > wouldn't be having this discussion today.
> If pigs could fly,...
> >
> >
> > - Peter
> >
> >
>
> --------------------------------------------------------------------------
> -
> > Peter Constable
> >
> > Non-Roman Script Initiative, SIL International
> > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > Tel: +1 972 708 7485
> > E-mail: <peter_constable@sil.org>
> >
> >
> >
> >
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT