Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Hans Aberg (haberg@math.su.se)
Date: Fri Jun 02 2006 - 15:34:56 CDT

Next message: Markus Scherer: "Re: UTF-7 - is it dead?"

Previous message: Mike Ayers: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
In reply to: John D. Burger: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

For such uses, such as those below, it is probably better finding
more efficient compression techniques, rather than hoping that UTF-8
should do the job. The original idea with UTF-8, coming from UNIX
developers, is compatibility with ASCII, not text compression. In
view of Moore's law <http://en.wikipedia.org/wiki/Moore%27s_Law>,
space will be fairly quickly sufficient for UTF-32 in any given
application. Otherwise, the argument for UTF-8 space efficiency can
also be made in favor of UTF-32 in time efficiency in type setting
programs like TeX, where some people may want to plus in a whole
encyclopedia, and get it compiled interactively in a fraction of a
second. So use UTF-8 alternatively UTF-32 where the tradeoff is most
practical and efficient for your needs at hand.

On 2 Jun 2006, at 19:38, John D. Burger wrote:

> Stephane Bortzmeyer wrote:
>
>> Show me someone who can fill a modern hard disk with only raw text
>> (Unicode is just that, raw text) encoded in UTF-32. Even UTF-256
>> would
>> not do it.
>
> Huh? There's a lot of text out there. I'm pretty sure that
> Google's cache fills far more than one hard disk, for instance.
>
> For a personal example, I do research with this text collection:
>
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?
> catalogId=LDC2003T05
>
> In UTF-32, this would take up close to 50 gigabytes, one-tenth of
> the disk on my machine. And LDC has dozens of such collections,
> although Gigaword is probably one of the biggest, and I'm typically
> only working with a handful at a time.
>
> I'm also about to begin some work on Wikipedia. The complete
> English dump, with all page histories, which is what I'm interested
> in, takes up about a terabyte. In UTF8.
>
> - John D. Burger
> MITRE
>
>

Next message: Markus Scherer: "Re: UTF-7 - is it dead?"
Previous message: Mike Ayers: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
In reply to: John D. Burger: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 15:41:12 CDT