From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Thu May 13 2004 - 04:22:28 CDT
Peter Constable wrote:
> UTF-8 sequences, as originally defined, could be longer than
> four bytes,
> in order to address codepoints in the vast expanse of UCS-4 at
> U+110000..U+FFFFFFFF. Since the accepted code space has been
> constrained
> to U+0000..U+10FFFF, only four bytes are needed. There are
> non-UTF-8s --
> beasts that kind of look like UTF-8 but aren't -- in which
> sequences of
> varying length represent the same character and sequences of more than
> four bytes appear, but they are not UTF-8; those byte sequences are
> considered illegal in UTF-8.
1. UCS-4, which is still defined by 10646 (but never by Unicode)
is limited at U-7FFF FFFF (nitpick: for some reason it's "U-"
not "U+"; don't ask me why). U-FFFF FFFF has always been
out of range. Probably so that one could use "signed" 32-bit
ints (not all p.l. have unsigned integer types).
2. That "original" definition of UTF-8 (which was never in Unicode)
is still the definition of UTF-8 in 10646. So UTF-8/Unicode is
not the same as UTF-8/10646. In practice it does not matter
very much, since there are no (and will never be) any characters
allocated above U+10FFFF, and the private use planes above
U+10FFFF (which were specified in 10646) have been removed.
3. Another nitpick: To reach up to (and above...) U-FFFF FFFF in a
UTF-8-like encoding would put the max number of bytes per
char to 7. There would be no data bit in the first byte of a 7-byte
sequence though, as it would consist exactly of 7 1s and 1 0. ;-)
/kent k
This archive was generated by hypermail 2.1.5 : Thu May 13 2004 - 04:29:57 CDT