Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Keld Jørn Simonsen (keld@dkuug.dk)
Date: Mon Jan 31 2000 - 08:45:35 EST


On Sun, Jan 30, 2000 at 03:37:14AM -0800, Dan wrote:
>
> > > (Basically, we say: "just pass on, and what
> > > happens at presentation is undefined. If the agent settles on to
> > > present it as, say, Latin-1, that is fine with me.)
> >
> > Reasonable for Usenet news, where most content is read by human
> > beings.
>
> I agree.
>
>
> > Here's a more precise version:
> >
> > UTF-8-xtra-head-2 = %d192-223
> > UTF-8-xtra-head-3 = %d224-239
> > UTF-8-xtra-head-4 = %d240-247
> > UTF8-xtra-tail = %d128-191
> > UTF8-xtra-char = UTF8-xtra-head-2 UTF8-xtra-tail
> > | UTF8-xtra-head-3 2*UTF8-xtra-tail
> > | UTF8-xtra-head-4 3*UTF8-xtra-tail
> >
> > Bytes %d247-253 are technically legal but will never be needed,
> > as Unicode/ISO 10646 will never grow beyond hex 0010FFFF except for
> > deprecated additional private-use zones that predate Unicode,
> > and bytes %254-255 are outright illegal.
>
> ISO 10646 is 31 bits. All possible values should be allowed.
> I do not know why Unicode have decided to grow their bits to
> more than 16 bits, but not to all 31 bits of ISO 10646.
> But that is no reason to not allow full 31 bits in UTF-8 encoded
> text.
>
> Specify UCS (ISO 10646) encoded in UTF-8 without character range limits.
> Do not restrict to current limits of Unicode.

I concur with Dan here. Actually it is IETF policy that all
Internet protocols can process UTF-8, and UTF-8 is here meant to
be the ISO 10646 specification.

Keld Simonsen



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT