Fw: 8-bit text which is supposed to be UTF-8 but isn't

From: Addison Phillips [GSC] (addison@globalsight.com)
Date: Sun Jan 30 2000 - 14:16:47 EST


> ISO 10646 is 31 bits. All possible values should be allowed.
> I do not know why Unicode have decided to grow their bits to
> more than 16 bits, but not to all 31 bits of ISO 10646.
> But that is no reason to not allow full 31 bits in UTF-8 encoded
> text.

 The reason Unicode had to grow was that there turn out to be more than 2^16
 characters to encode. By adding 15 additional 16-bit planes, there is more
 than enough space to encode everything that wouldn't fit into the BMP.. and
 room left for some fantasy scripts to fill our idle hours [Cirth, anyone?].

 ISO 10646 has agreed, I thought, to follow Unicode's restriction and
 promised, I thought, not to encode anything "out of bounds".

 The reason for the restriction was the expansion mechanism chosen for
 traditional 16-bit Unicode, which is surrogate pairs. These are special
 characters in the BMP to represent characters in the upper planes. These
are
 the surrogate pairs. Unlike many "stateful" multi-byte character sets from
 the past, Unicode did programmers everywhere a huge favor. There is a
 restricted range of lead-characters (character in the Unicode sense of a
two
 octet 16-bit character) and a restricted range of trailing characters in a
surrogate pair. A
 lead-character can never be anything BUT a lead character. A
 trail-character can never be anything but a trail-character. This preserves
 the extremely critical Unicode premise that if you see a character value
 then that *is* the character. It may be combined with other characters, but
 it is never, ever, anything else.

 The alternative was shift states and the re-creation of the whole multibyte
 world. Yuck.

 So: since Unicode has adopted an expansion mechanism that allows only
10FFFF
 characters and since there will never, ever, be any data encoded outside
 that range (we have all been assured), it is IMHO a good idea to reflect
 that fact in your UTF-8 implementation. It is too late to levitate out of
 the corner we are painted into. A system that sees data outside the legal
range
may be dealing with a different encoding or with binary trash and should do
"something intelligent" (other than reporting that this is valid UTF-8).

 thanks,

 Addison

 Addison Phillips
 Sr. Globalization Consultant
 GlobalSight Corporation



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT