From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Sep 19 2006 - 00:46:00 CDT
Some additional thoughts:
There are certain types of applications (web log processing for one)
where handling vast amount of "text" data is important. In these
situations, a reasonably dense representation of data will enable more
processing with many fewer cache misses. The size of the mass-storage
device is irrelevant, it's the size of your 'peephole' represented by
your cache that's the limiting factor.
In such situations, you cannot afford to compress/uncompress, as most
data is seen only once.
Finally, if most (much) of your data is ASCII due to the ASCII bias of
protocols, then any format that's close to ASCII is beneficial. UTF-8
fits that bill. SCSU and BOCU take too much processing time compared to
UTF-8, and UTF-16/32 take too much space given the assumptions.
Add to that the fact that often data streams are already in UTF-8, and
that format becomes the format of choice for applications that have the
constraints mentioned. (As has been pointed out, the 'bloat' for CJK is
not a factor as long as the data always contains high proportion of ASCII.)
SCSU was developed for a system that had bandwidth limitations in
straight transmission. Except for communications to remote areas
(wilderness, marine, space etc.) such severe limitations on transmission
bandwidth are a thing of the past, and even then, the block compression
algorithms can often be used to good advantage.
As for mass-storage limitations (or not) the rest of the thread contains
sufficient discussion.
A./
This archive was generated by hypermail 2.1.5 : Tue Sep 19 2006 - 00:50:22 CDT