From: Hans Aberg (haberg@math.su.se)
Date: Tue Sep 19 2006 - 06:47:11 CDT
On 19 Sep 2006, at 07:46, Asmus Freytag wrote:
> In such situations, you cannot afford to compress/uncompress, as
> most data is seen only once.
Sure you can, you merely cannot do the base the compression based on
the whole of the data. So either divide it in subpackets, or make an
assumption of what the statistical proportions might be. This is not
as efficient as copression the whole data, but modems and streaming
video and the like use compression techniques, so it is surely
possible to do it one the fly on a stream.
> Finally, if most (much) of your data is ASCII due to the ASCII bias
> of protocols, then any format that's close to ASCII is beneficial.
> UTF-8 fits that bill. SCSU and BOCU take too much processing time
> compared to UTF-8, and UTF-16/32 take too much space given the
> assumptions.
>
> Add to that the fact that often data streams are already in UTF-8,
> and that format becomes the format of choice for applications that
> have the constraints mentioned. (As has been pointed out, the
> 'bloat' for CJK is not a factor as long as the data always contains
> high proportion of ASCII.)
It is probably more efficient to translate the stream into code
points and then use a compression technique on that, because then the
full character structure is taken into account. Then it does not
matter which character encoding is used.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Tue Sep 19 2006 - 06:47:59 CDT