Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Doug Ewell (dewell@compuserve.com)
Date: Tue Feb 01 2000 - 10:45:44 EST


Dan Oscarsson <Dan.Oscarsson@trab.se> wrote:

> My text was maybe unclear. UTF-8 should represent the characters
> of UCS in the code range 0-255 as themselves, just like UTF-16 does
> for UCS in the 16-bit range.
> As there are two sets of control spaces in the first 256 code points,
> and one of them is nearly not used, they could be used to make it work.
> But it is to late to fix that now.

UTF-8 uses a range of 128 (actually 126) bytes to represent multi-byte
UCS-4 characters, and it is sometimes criticized for requiring 3 bytes
to cover the range U+0800 to U+FFFF. How long would the sequences be in
a UTF that was restricted to only 32 bytes for multi-byte characters
(and still avoided the lead byte/trail byte ambiguity)? Anyone care to
work that out?

You could not use the C0 control range (0x00 to 0x1F) because so many of
the characters in that range are in extremely common use. You would be
effectively limited to the C1 range (0x80 to 0x9F). Anything more would
require a complex algorithm to avoid CR, LF, FF, and such, and then you
would have something more closely resembling SCSU.

Now if Dan wanted to make the case that Jörg Knappen's UTF-7.5 was a
better design, because it retains some readability for some Latin-1
characters, that would at least be plausible. But UTF-7.5 gives up
some of the design features of UTF-8, such as avoidance of 0xFE and
0xFF, and requires 3-byte sequences beginning at U+0400 instead of
U+0800, which does not endear it to users of Cyrillic, Hebrew, and
Arabic.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT