From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 22 2007 - 15:59:39 CST
Ruszlan said:
> Uh... actually my main point was to devise a fixed-length encoding
> for Unicode that wouldn't waste 1 octet out of 4 and could make
> some use of the remaining 3 spare bits.
And I don't think anyone is disputing that you have done that.
But...
"Wasting" 1 octet out of 4 is a non-issue. If text storage is
the concern, as Mark already pointed out, then either UTF-8 or
UTF-16 is going to be more efficient than UTF-21A or UTF-24A.
Since Unicode text consists of the same or fewer bytes in either UTF-8
or UTF-16 for all but the most contrived of texts, it will also be more
efficient for interchange -- which is the real bottleneck, rather
than raw storage space per se.
UTF-21/24 is only a compression win when compared against UTF-32,
but implementers *already* have better options in UTF-8 or UTF-16,
if they are counting bytes.
The counter is that UTF-21/24 is fixed-width -- unlike UTF-8
and UTF-16 -- which is true. But that benefit is an
illusion, because it isn't a true processing form. Characters
in UTF-21/24 aren't integral values, but sequences of 3 bytes
that have to be unpacked for interpretation. Effectively you
have to turn them into UTF-32 for processing *anyway*, and
since modern processor architectures use 32-bit or 64-bit registers,
and *don't* use 24-bit registers, you aren't doing anybody
any favors by using a "packed" form for the characters which
have to be unpacked into integral values for decent processing
characteristics in the first place.
Finally, "mak[ing] some use of the remaining 3 spare bits" I
consider to be a cardinal error in character encoding design.
Using bits in a character encoding form for anything else than
representation of the integral values of the encoded characters
is a guarantee to additional complexity, errors, and opportunities
for data corruption in handling text data. Putting "parity bits"
into character streams is just bad design. Data integrity
for streaming data should be handled instead by
*data* protocols that handle the problem generically for *all*
data, including any embedded character data.
> I wouldn't say that cutting storage requirements by 25% is insignificant.
Except that your comparison is against the strawman of storing all
Unicode text as raw UTF-32 data, which is almost never the case.
> And consider convenience of fixed-length format for many text-processing
> tasks - you can enumerate characters by simply enumerating the octets,
> without having to perform additional checks for multibyte sequences or
> surrogate pairs each time. This could save a lot of computation on
> long texts.
This argument won't fly, because it presumes that such text
stored that way can be appropriate at the same time for *other*
text-processing tasks, which is manifestly not the case. You can always
create special-purpose formats that perform well for some particular
algorithmic requirement (here: enumerating characters). But such formats
fail as general-purpose formats for text if they introduce countervailing
complexity in other tasks, as UTF-21/24 would.
> Again, if space requirement is a big issue and fixed-length properties
> are not required, UTF-21A data can easily be converted to the
> variable-length format proposed by Frank Ellermann, and then,
> just as easily converted back when fixed-length is preferred over
> saving space.
And this is frankly no gain over doing UTF-8/UTF-32 conversions, with
the net result of better compression *and* ASCII compatibility for
the UTF-8 form and cleaner processing for the UTF-32 form.
> Well, UTF-24A should be regarded as an extension of UTF-21A that
> provides a built-in error detection mechanism if required.
It isn't (required).
> Once validated, UTF-24A data can be processed as UTF-21A by
> simply ignoring the most significant bits of each octet. After
> the text was modified, inserted characters would be easy to detect
> by simply checking the most significant bit of the most significant
> octet, so parity will have to be recalculated only for those code
> units.
And that is the kind of mixing of levels that no text processing
algorithm should get involved in.
> Again, if data integrity and byte order is not a concern, the
> text can be converted to UTF-21A by simply resetting all 8th bits
> to 0.
At which point, UTF-21A loses all self-synchronization properties,
making it worse than UTF-8 or UTF-16 in that regard. You get
a fixed-width but *multi*-byte encoding with two possible
incorrect registrations of character edges. So a byte error in
handling can destroy the *entire* text.
<U+4E00, U+4E8C, U+4E09, U+56DB>
Chinese for "yi1, er2, san2, si4" '1, 2, 3, 4'
UTF-21A -->
<01 1C 00 01 1D 0C 01 1C 09 01 5D 5B>
If you lost the 2nd byte by an error, then the resulting
byte sequence:
<01 00 01 1D 0C 01 1C 09 01 5D 5B>
would reconvert to:
<U+4001, U+74601, U+70481, ???>
in other words, an unrelated Chinese character, two unassigned
code points on plane 7, and a conversion error
for the last two bytes. This on top of the fact that the
UTF-21A string would have other troubles in much char-oriented
string handling, because of the embedded null bytes and other
byte values unrelated to their ASCII values. (You could, of
course fix that by setting the high bit to 1 for all octets,
instead, giving you yet another flavor of UTF-21A, but anyway...)
Compare UTF-8 -->
<E4 B8 80 E4 BA 8C E4 B8 89 E5 9B 9B> (same number of bytes, notice)
If you lost the 2nd byte by an error, then the resulting
byte sequence:
<E4 80 E4 BA 8C E4 B8 89 E5 9B 9B>
would reconvert to:
<???, U+4E8C, U+4E09, U+56DB>
Because UTF-8 is self-synchronizing, the loss of data is
localized and doesn't propagate down the string.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 16:02:21 CST