Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: John D. Burger (john@mitre.org)
Date: Fri Jun 02 2006 - 12:38:48 CDT

Next message: Richard Wordingham: "Re: UTF-7 - is it dead?"

Previous message: Andreas Prilop: "Re: Glyphs for German quotation marks"
In reply to: Stephane Bortzmeyer: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Mike Ayers: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Mike Ayers: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Hans Aberg: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Stephane Bortzmeyer wrote:

> Show me someone who can fill a modern hard disk with only raw text
> (Unicode is just that, raw text) encoded in UTF-32. Even UTF-256 would
> not do it.

Huh? There's a lot of text out there. I'm pretty sure that Google's
cache fills far more than one hard disk, for instance.

For a personal example, I do research with this text collection:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

In UTF-32, this would take up close to 50 gigabytes, one-tenth of the
disk on my machine. And LDC has dozens of such collections, although
Gigaword is probably one of the biggest, and I'm typically only working
with a handful at a time.

I'm also about to begin some work on Wikipedia. The complete English
dump, with all page histories, which is what I'm interested in, takes
up about a terabyte. In UTF8.

- John D. Burger
MITRE

Next message: Richard Wordingham: "Re: UTF-7 - is it dead?"
Previous message: Andreas Prilop: "Re: Glyphs for German quotation marks"
In reply to: Stephane Bortzmeyer: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Mike Ayers: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Mike Ayers: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Hans Aberg: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 13:21:35 CDT