RE: A UTF-8 based News Service

From: Ayers, Mike (Mike_Ayers@bmc.com)
Date: Fri Jul 13 2001 - 12:48:30 EDT


> From: DougEwell2@cs.com [mailto:DougEwell2@cs.com]

> Raw UTF-8 4,382,592
> Zipped UTF-8 2,264,152 (52% of raw UTF-8)
> Raw SCSU 1,179,688 (27% of raw UTF-8)
> Zipped SCSU 104,316 (9% of raw SCSU, < 5% of zipped UTF-8)

        The data set is truly pathological. Since it is in code point
order, there are patterns which in it which are probably being exploited.
Why not download some of the articles from a certain UTF-8 based news
website and run them through the tests?

        Side note on compression: Specialized compression methods tend to
have a bit of maintenance overhead. Generalized compression methods tend,
in practice, to be better because they squeeze extra compression out of data
that would otherwise not be worth compressing (need more disk space? don't
get a specialized compression routine for your biggest file - just compress
the whole drive!). The general solution for the internet is IPPC (Internet
Protocol Payload Compression), which compresses all IP packets. I am not
sure what state of development it is in, but if it is implementable now, I
would highly recommend testing its performance on Unicode web data. I
expect the results to be comparable to specialized techniques, and you'd get
a transparent (IPPC is implemented such that hosts that don't support it are
unaffected by it), i.e. low maintenance, solution.

/|/|ike



This archive was generated by hypermail 2.1.2 : Fri Jul 13 2001 - 14:20:06 EDT