Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Theodore H. Smith (delete@elfdata.com)
Date: Fri Jun 02 2006 - 16:33:46 CDT

  • Next message: Mike Ayers: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"

    Moores law doesn't mean we should be more wasteful.

    It could just mean that we should waste less, and have less
    ecological impact. or it could mean that we just make more money.

    If you get a computer 4x as fast, instead of using UTF-32 of UTF-8,
    you could maybe make 4x the money by having 4x the throughput.

    I think by the time we are citing Moore's law, this isn't really a
    Unicode discussion, but a computing in general discussion...

    My original point was that UTF-8 can be used for more than it is
    given credit for. You can do lowercasing, uppercasing, normalisation,
    and just about anything, on UTF-8, without corruption or mistakes,
    and do it CPU efficiently and far more space efficiently.

    Whether or not enough tools exist out there for doing all that on
    UTF-8, is another matter. I've built some tools for UTF-8, and I'd
    like to build more.

    And the other point is that a character (aka unicode glyph) is a
    string. So whatever you do, you'll need to be be string processing,
    treating each character as a variable length unit, so it might as
    well be a variable 8-bit length unit than 32bit...

    Therefor, I win the discussion. Thank you :)

    On 2 Jun 2006, at 21:34, Hans Aberg wrote:

    > For such uses, such as those below, it is probably better finding
    > more efficient compression techniques, rather than hoping that
    > UTF-8 should do the job. The original idea with UTF-8, coming from
    > UNIX developers, is compatibility with ASCII, not text compression.
    > In view of Moore's law <http://en.wikipedia.org/wiki/Moore%
    > 27s_Law>, space will be fairly quickly sufficient for UTF-32 in any
    > given application. Otherwise, the argument for UTF-8 space
    > efficiency can also be made in favor of UTF-32 in time efficiency
    > in type setting programs like TeX, where some people may want to
    > plus in a whole encyclopedia, and get it compiled interactively in
    > a fraction of a second. So use UTF-8 alternatively UTF-32 where the
    > tradeoff is most practical and efficient for your needs at hand.
    >
    >
    > On 2 Jun 2006, at 19:38, John D. Burger wrote:
    >
    >> Stephane Bortzmeyer wrote:
    >>
    >>> Show me someone who can fill a modern hard disk with only raw text
    >>> (Unicode is just that, raw text) encoded in UTF-32. Even UTF-256
    >>> would
    >>> not do it.
    >>
    >> Huh? There's a lot of text out there. I'm pretty sure that
    >> Google's cache fills far more than one hard disk, for instance.
    >>
    >> For a personal example, I do research with this text collection:
    >>
    >> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?
    >> catalogId=LDC2003T05
    >>
    >> In UTF-32, this would take up close to 50 gigabytes, one-tenth of
    >> the disk on my machine. And LDC has dozens of such collections,
    >> although Gigaword is probably one of the biggest, and I'm
    >> typically only working with a handful at a time.
    >>
    >> I'm also about to begin some work on Wikipedia. The complete
    >> English dump, with all page histories, which is what I'm
    >> interested in, takes up about a terabyte. In UTF8.
    >>
    >> - John D. Burger
    >> MITRE
    >>
    >>
    >
    >
    >
    >

    --
    http://elfdata.com/plugin/
    


    This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 16:43:32 CDT