Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Doug Ewell (dewell@compuserve.com)
Date: Sun Jan 30 2000 - 12:49:29 EST


Dan <Dan.Oscarsson@trab.se> wrote:

>> Bytes %d247-253 are technically legal but will never be needed,
>> as Unicode/ISO 10646 will never grow beyond hex 0010FFFF except for
>> deprecated additional private-use zones that predate Unicode,
>> and bytes %254-255 are outright illegal.
>
> ISO 10646 is 31 bits. All possible values should be allowed.
> I do not know why Unicode have decided to grow their bits to
> more than 16 bits, but not to all 31 bits of ISO 10646.
> But that is no reason to not allow full 31 bits in UTF-8 encoded
> text.

There IS a reason: to allow all of Unicode to be expressed in UTF-8.

You may certainly write your code to understand all 31 bits, but no
values beyond U-0010FFFF will be assigned, so the extra code will be
unnecessary (although harmless).

Please see Technical Report #19 for more information.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT