From: John D. Burger (john@mitre.org)
Date: Fri Jun 02 2006 - 12:38:48 CDT
Stephane Bortzmeyer wrote:
> Show me someone who can fill a modern hard disk with only raw text
> (Unicode is just that, raw text) encoded in UTF-32. Even UTF-256 would
> not do it.
Huh? There's a lot of text out there. I'm pretty sure that Google's
cache fills far more than one hard disk, for instance.
For a personal example, I do research with this text collection:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05
In UTF-32, this would take up close to 50 gigabytes, one-tenth of the
disk on my machine. And LDC has dozens of such collections, although
Gigaword is probably one of the biggest, and I'm typically only working
with a handful at a time.
I'm also about to begin some work on Wikipedia. The complete English
dump, with all page histories, which is what I'm interested in, takes
up about a terabyte. In UTF8.
- John D. Burger
MITRE
This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 13:21:35 CDT