From: Hans Aberg (haberg@math.su.se)
Date: Fri Jun 02 2006 - 15:34:56 CDT
For such uses, such as those below, it is probably better finding
more efficient compression techniques, rather than hoping that UTF-8
should do the job. The original idea with UTF-8, coming from UNIX
developers, is compatibility with ASCII, not text compression. In
view of Moore's law <http://en.wikipedia.org/wiki/Moore%27s_Law>,
space will be fairly quickly sufficient for UTF-32 in any given
application. Otherwise, the argument for UTF-8 space efficiency can
also be made in favor of UTF-32 in time efficiency in type setting
programs like TeX, where some people may want to plus in a whole
encyclopedia, and get it compiled interactively in a fraction of a
second. So use UTF-8 alternatively UTF-32 where the tradeoff is most
practical and efficient for your needs at hand.
On 2 Jun 2006, at 19:38, John D. Burger wrote:
> Stephane Bortzmeyer wrote:
>
>> Show me someone who can fill a modern hard disk with only raw text
>> (Unicode is just that, raw text) encoded in UTF-32. Even UTF-256
>> would
>> not do it.
>
> Huh? There's a lot of text out there. I'm pretty sure that
> Google's cache fills far more than one hard disk, for instance.
>
> For a personal example, I do research with this text collection:
>
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?
> catalogId=LDC2003T05
>
> In UTF-32, this would take up close to 50 gigabytes, one-tenth of
> the disk on my machine. And LDC has dozens of such collections,
> although Gigaword is probably one of the biggest, and I'm typically
> only working with a handful at a time.
>
> I'm also about to begin some work on Wikipedia. The complete
> English dump, with all page histories, which is what I'm interested
> in, takes up about a terabyte. In UTF8.
>
> - John D. Burger
> MITRE
>
>
This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 15:41:12 CDT