From: Ruszlan Gaszanov (ruszlan@ather.net)
Date: Mon Jan 22 2007 - 13:29:23 CST
Doug Ewell wrote:
> Conversion to and from UTF-8 is really quite simple. It may look like a
> lot of lines of code, but most of it is conditional -- only one of the
> branches runs for each lead byte.
I'm not claiming UTF-8 is complicated - I've implemented encoders/decoders for all standard Unicode encoding schemes and know how they work. But you'll have to admit that UTF-21A (as I refer to it in my original post) is still much simpler and more straightforward.
> Your ideas reminded me of the variable-length scheme Frank mentioned.
> (I thought I had invented that one too, based on Mark Crispin's
> mostly-whimsical UTF-9 RFC.)
Uh... actually my main point was to devise a fixed-length encoding for Unicode that wouldn't waste 1 octet out of 4 and could make some use of the remaining 3 spare bits.
> But for storage purposes, you don't want to use 3 bytes for each
> character -- not with the overwhelming prevalence of BMP characters in
> almost all text. There's a reason why almost nobody uses UTF-32;
> cutting the storage from four bytes to three won't change that.
I wouldn't say that cutting storage requirements by 25% is insignificant. And consider convenience of fixed-length format for many text-processing tasks - you can enumerate characters by simply enumerating the octets, without having to perform additional checks for multibyte sequences or surrogate pairs each time. This could save a lot of computation on long texts. Again, if space requirement is a big issue and fixed-length properties are not required, UTF-21A data can easily be converted to the variable-length format proposed by Frank Ellermann, and then, just as easily converted back when fixed-length is preferred over saving space.
> And for interchange, you don't want the overhead of calculating or checking
> parity for each 3-byte series. It's not as computationally cheap as it
> seems, compared to decoding UTF-8 or even SCSU. (The complexity of
> decoding SCSU is vastly overstated, as I wrote in Unicode Technical Note
> #14.)
Well, UTF-24A should be regarded as an extension of UTF-21A that provides a built-in error detection mechanism if required. Once validated, UTF-24A data can be processed as UTF-21A by simply ignoring the most significant bits of each octet. After the text was modified, inserted characters would be easy to detect by simply checking the most significant bit of the most significant octet, so parity will have to be recalculated only for those code units. Again, if data integrity and byte order is not a concern, the text can be converted to UTF-21A by simply resetting all 8th bits to 0.
> It's true that you don't need a Byte Order Mark per se with a byte-based
> encoding such as this, but you might still want to be able to use U+FEFF
> as an encoding signature. All Unicode encodings have this defined. The
> problem with U+FEFF is not so much its use as a byte order mark or
> signature, but rather its parallel and conflicting use as a zero-width
> no-break space (which was never widely used and which is now
> deprecated).
Well, the nice thing about UTF-24A is that its code units' pattern is quite distinct and easily detectable, so you don't need U+FEFF even as a signature. Admittedly, UTF-8 multibyte sequences are also quite distinct - but in a predominantly-ASCII text without U+FEFF signature, you might have to analyze content of the entire file in order to tell whether it's UTF-8, legacy charset or 7-bit ASCII. Hence, such files originally encoded as UTF-8 can be easily mistaken for a legacy charset data by the user and/or software and edited as such. As a result a "bastard" UTF-8/charset encoded file is produced, with all inplications thereoff... Which proves BTW that ASCII-transparance is not always a good thing.
Ruszlán
This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 13:30:54 CST