From: Doug Ewell (dewell@adelphia.net)
Date: Wed Jan 21 2004 - 02:12:18 EST
Elliotte Rusty Harold <elharo at metalab dot unc dot edu> wrote:
> In developing such a format I have a couple of advantages:
>
> 1. Most C0 controls are forbidden, and will not appear in the data.
> That's already verified. If someone tries to pass in a C0 control
> other than tab, linefeed, or carriage return to setValue, an
> exception is thrown and the data is not stored. Potentially one or
> more of these characters could be used as markers in the stream.
Oooh. That could potentially be a problem with SCSU, since the SQU tag
(needed to switch from single-byte mode to so-called "Unicode mode") is
0x0F, and since characters in the range U+xx00 through U+xx1F (for any
non-zero value of xx) stored in "Unicode mode" would store the LSB
directly, conflicting with C0 controls.
BOCU-1 might solve this problem, but multiplying and dividing by 243
doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by the
claim in UTN #6 that converting Hindi text between UTF-16 and BOCU-1
took only 45% as long as converting it between UTF-16 and UTF-8.)
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Wed Jan 21 2004 - 03:56:33 EST