Unicode for words?

From: Tim Finney (tfinney@reltech.org)
Date: Sun Dec 05 2004 - 02:27:45 CST

  • Next message: Bob Hallissy: "Re: script complexity, was Re: OpenType vs TrueType (was current version of unicode-font)"

    Dear All

    This is off topic, so feel free to ignore it.

    The other day I was telling a co-worker about Unicode and how the UTF-8
    encoding system works. During the far ranging discussions that followed
    (we are public servants), my co-worker suggested encoding entire words
    in Unicode.

    This sounds like heresy to all of us who know that Unicode is meant only
    for characters. But wait a minute... Aren't there a whole lot of
    codepoints that will never be used? 231 is a big number. I imagine that
    it could contain all of the words of all of the languages as well as all
    of their characters. According to Marcus Kuhn's Unicode FAQ
    (http://www.cl.cam.ac.uk/~mgk25/unicode.html), "Current plans are that
    there will never be characters assigned outside the 21-bit code space
    from 0x000000 to 0x10FFFF, which covers a bit over one million potential
    future characters".

    So here is the idea: why not use the unused part (231 - 221 =
    2,145,386,496) to encode all the words of all the languages as well. You
    could then send any word with a few bytes. This would reduce the
    bandwidth necessary to send text. (You need at most six bytes to address
    all 231 code points, and with a knowledge of word frequencies could
    assign the most frequently used words to code points that require
    smaller numbers of bytes.) Whether text represents a significant
    proportion of bandwidth use is an important question, but because
    bandwidth = money, this idea could save quite a lot, even if text only
    represents a small proportion of the total bandwidth. Phone companies
    could use encoded words for transmitting SMS messages, thereby saving
    money on new mobile tower installations, although they are going to put
    in G3 (video-capable) anyway.

    All of the machinery (Unicode, UTF-8, web crawlers that can work out
    what words are used most often) is already there.

    Someone must have already thought of this? If not, my co-worker, Zack
    Alach, deserves the kudos.

    Best

    Tim Finney



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 02:28:45 CST