From: Ruszlan Gaszanov (ruszlan@ather.net)
Date: Tue Jan 23 2007 - 06:53:54 CST
Kenneth Whistler wrote:
> "Wasting" 1 octet out of 4 is a non-issue.
So you are saying that, because we have enough storage space not to count every bit,
we should garbage it with meaningless NUL bytes at every possible opportunity,
simply because we can *afford to*? Like storing 24bpp graphics in 32bpp format? Why
not UTF-64 and 64bps bitmaps while we are at it? (What the heck - storage space is a
non-issue!)
But if you are *really serious* about garbageng your HD with meaningless NULs, I
personally recommend appending \x00 to garbage.bin file in infinite loop - this
method requires maximum 3 lines of code, while being ultimately superior to any
other approach! ;)
> UTF-21/24 is only a compression win when compared against UTF-32,
> but implementers *already* have better options in UTF-8 or UTF-16,
> if they are counting bytes.
UTF-8 and UTF-16 provide better compression for *some* ranges at the expense of
*others* - based on the authors' preferences for certain scripts. This is comparable
to a 24bpp graphics encoding format, which would encode blue component with 8 bit at
the expense of encoding red with 32 bit, simply because the author likes shades of
blue and dislikes shades of red.
> The counter is that UTF-21/24 is fixed-width -- unlike UTF-8
> and UTF-16 -- which is true. But that benefit is an
> illusion, because it isn't a true processing form. Characters
> in UTF-21/24 aren't integral values, but sequences of 3 bytes
> that have to be unpacked for interpretation.
Who said UTF-21/24 *can't* be handled as integral values? What exactly prevents you
from reading consecutive 24 bits into a dword variable? Or outputting 24 least
significant bits of dword into a file while discarding the rest?
> Effectively you have to turn them into UTF-32 for processing *anyway*,
> and since modern processor architectures use 32-bit or 64-bit registers,
> and *don't* use 24-bit registers
Have you ever considered a possibility that Unicode might outlive now-modern
32/64-bit processors by many decades (like ASCII outlived teleprinters, it was
princilally designed for)? The fact that a 24-bit sequence is effectively the same
as 32-bit sequence for most currently used processors does not mean it'll still be
true 10 years from now.
> you aren't doing anybody any favors by using a "packed" form for
> the characters which have to be unpacked into integral values for
> decent processing characteristics in the first place.
Again, who said anything about "packed" form?
> Finally, "mak[ing] some use of the remaining 3 spare bits" I
> consider to be a cardinal error in character encoding design.
> Using bits in a character encoding form for anything else than
> representation of the integral values of the encoded characters
> is a guarantee to additional complexity, errors, and opportunities
> for data corruption in handling text data.
Wait a minute, don't UTF-8 multibyte sequences and UTF-16 surrogate pairs do *just
that*?
>> I wouldn't say that cutting storage requirements by 25% is insignificant.
>
> Except that your comparison is against the strawman of storing all
> Unicode text as raw UTF-32 data, which is almost never the case.
As pointed out in previous posts, the main reason Unicode text is almost never
stored as raw UTF-32 because UTF-32 is bad design for storage.
>> And consider convenience of fixed-length format for many text-processing
>> tasks - you can enumerate characters by simply enumerating the octets,
>> without having to perform additional checks for multibyte sequences or
>> surrogate pairs each time. This could save a lot of computation on
>> long texts.
>
> This argument won't fly, because it presumes that such text
> stored that way can be appropriate at the same time for *other*
> text-processing tasks, which is manifestly not the case. You can always
> create special-purpose formats that perform well for some particular
> algorithmic requirement (here: enumerating characters). But such formats
> fail as general-purpose formats for text if they introduce countervailing
> complexity in other tasks, as UTF-21/24 would.
What complexity? Give me one reason why UTF-21 code units can't be treated as scalar
values?
> And this is frankly no gain over doing UTF-8/UTF-32 conversions, with
> the net result of better compression *and* ASCII compatibility for
> the UTF-8 form and cleaner processing for the UTF-32 form.
Can't really see any point here except for "ASCII compatibility" argument un favor
of UTF-8. But, IMHO, ASCII compatibility is not always an advantage as it often
leads to confusion of data formats and creation of invalidly-encoded data (ever seen
web-pages where UTF-8 is intermixed with legacy charset?). We may need
ASCII-transparence for the sake of legacy protocols and parsers, but I don't think
UTF-8 should be used as native format for new applications *exactly* because of it's
ASCII compatibility.
>> Well, UTF-24A should be regarded as an extension of UTF-21A that
>> provides a built-in error detection mechanism if required.
>
> It isn't (required).
>
> . . .
>
> At which point, UTF-21A loses all self-synchronization properties,
> making it worse than UTF-8 or UTF-16 in that regard. You get
> a fixed-width but *multi*-byte encoding with two possible
> incorrect registrations of character edges. So a byte error in
> handling can destroy the *entire* text.
>
> . . .
>
> Because UTF-8 is self-synchronizing, the loss of data is
> localized and doesn't propagate down the string.
Is my English so bad or is there a contradiction in those three statements from
different parts of the post? First you say error-detection is not required, and
claim that UTF-24 therefore is useless. Then you describe a problem of byte errors
(which is exactly what UTF-24 was designed to address) and describe how UTF-21 is
vulnerable to it (which is obvious by definition). Finally you praise UTF-8 for its
built-in error-detection mechanism (which is exactly the opposite of what you
claimed in your first statement).
BTW, UTF-16 is no better and no worse then UTF-21 as far as byte errors are
concerned. The problem is *exactly the same* for UTF-16, UTF-32 and UTF-21.
Therefore, such formats should only be used where data integrity is assured by other
means.
>> Once validated, UTF-24A data can be processed as UTF-21A by
>> simply ignoring the most significant bits of each octet. After
>> the text was modified, inserted characters would be easy to detect
>> by simply checking the most significant bit of the most significant
>> octet, so parity will have to be recalculated only for those code
>> units.
>
> And that is the kind of mixing of levels that no text processing
> algorithm should get involved in.
Right! Transcoding several MB worth of text from UTF-8 to UTF-32 and back is
definitely *less* computationally-expansive then recalculating parity bits for a
handful of newly added characters!
> This on top of the fact that the
> UTF-21A string would have other troubles in much char-oriented
> string handling, because of the embedded null bytes and other
> byte values unrelated to their ASCII values. (You could, of
> course fix that by setting the high bit to 1 for all octets,
> instead, giving you yet another flavor of UTF-21A, but anyway...)
UTF-7 and UTF-8 are fine where backward-compatibility with 7-bit ASCII or 8-bit
charsets respectively is an issue. Just as UTF-16 is fine as a solution for backward
compatibility with UCS-2. But should we really promote legacy-friendly formats,
where we don't have to?
Ruszlan
This archive was generated by hypermail 2.1.5 : Tue Jan 23 2007 - 06:55:49 CST