Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: John D. Burger (john@mitre.org)
Date: Fri Jun 02 2006 - 12:38:48 CDT

  • Next message: Richard Wordingham: "Re: UTF-7 - is it dead?"

    Stephane Bortzmeyer wrote:

    > Show me someone who can fill a modern hard disk with only raw text
    > (Unicode is just that, raw text) encoded in UTF-32. Even UTF-256 would
    > not do it.

    Huh? There's a lot of text out there. I'm pretty sure that Google's
    cache fills far more than one hard disk, for instance.

    For a personal example, I do research with this text collection:

       http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

    In UTF-32, this would take up close to 50 gigabytes, one-tenth of the
    disk on my machine. And LDC has dozens of such collections, although
    Gigaword is probably one of the biggest, and I'm typically only working
    with a handful at a time.

    I'm also about to begin some work on Wikipedia. The complete English
    dump, with all page histories, which is what I'm interested in, takes
    up about a terabyte. In UTF8.

    - John D. Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 13:21:35 CDT