Re: Tamil

From: Mark E. Shoulson (mark@kli.org)
Date: Sun Feb 13 2011 - 10:46:43 CST

  • Next message: Mahesh T. Pai: "Re: Tamil"

    On 02/13/2011 09:59 AM, anbu@peoplestring.com wrote:
    > Tamil letters ஙா(0B99+0BBE), ஙி(0B99+0BBF), ஙீ(0B99+0BC0), ஙு(0B99+0BC1),
    > ஙூ(0B99+0BC2), ஙெ(0B99+0BC6), ஙே(0B99+0BC7), ஙை(0B99+0BC8), ஙொ(0B99+0BCA),
    > ஙோ(0B99+0BCB), ஙௌ(0B99+0BCC), ஞி(0B9E+0BBF), ஞீ(0B9E+0BC0), ஞு(0B9E+0BC1),
    > ஞூ(0B9E+0BC2), ஞெ(0B9E+0BC6), ஞே(0B9E+0BC7), ஞை(0B9E+0BC8), ஞொ(0B9E+0BCA),
    > ஞோ(0B9E+0BCB), ஞௌ(0B9E+0BCC) are almost unused and most Tamil symbols less
    > used. We can assign them to more bits instead of the 16 bits they are
    > assigned to, as they are occupying space with almost no use.
    >
    Indeed. This is the basis for Huffman Coding (see
    http://en.wikipedia.org/wiki/Huffman_coding ). And it should be
    considered when compressing text. But if you are suggesting that the
    codings in Unicode be changed, that really won't work, for several reasons.

    For one thing, Unicode has all these stability regulations: they are not
    going to change anything that's already been assigned (even if it's
    actually wrong!) Too much depends on what is already done to allow that.

    Also, Unicode is generally about assigning codes to characters, and the
    simplest way to do that is to assign codes of the same length to
    everything. This is not the most efficient way in terms of bit-length,
    as you point out, but that isn't the point of Unicode. For efficiency
    in those terms, there are compression algorithms, like Huffman coding
    and others. And that makes sense, too. Doing a general Huffman coding
    over ALL of the Unicode characters and their general usage across the
    whole corpus as it stands now would be very inefficient when applied to
    individual documents. A document written in (say) Phags-Pa would
    probably take a lot more bits per character than one written in ASCII,
    because Phags-Pa has much less usage altogether, but if we do the
    Huffman coding *afterwards*, based only on the frequency of that
    document, then the rarity of Phags-Pa with respect to Latin letters no
    longer matters, and we wind up with much shorter codes for the letters
    we are actually using.

    Those characters aren't "occupying space". They only occupy space when
    you use them, which as you said is not very often.

    ~mark



    This archive was generated by hypermail 2.1.5 : Sun Feb 13 2011 - 10:49:20 CST