From: Mark E. Shoulson (mark@kli.org)
Date: Sun Feb 13 2011 - 10:46:43 CST
On 02/13/2011 09:59 AM, anbu@peoplestring.com wrote:
> Tamil letters ஙா(0B99+0BBE), ஙி(0B99+0BBF), ஙீ(0B99+0BC0), ஙு(0B99+0BC1),
> ஙூ(0B99+0BC2), ஙெ(0B99+0BC6), ஙே(0B99+0BC7), ஙை(0B99+0BC8), ஙொ(0B99+0BCA),
> ஙோ(0B99+0BCB), ஙௌ(0B99+0BCC), ஞி(0B9E+0BBF), ஞீ(0B9E+0BC0), ஞு(0B9E+0BC1),
> ஞூ(0B9E+0BC2), ஞெ(0B9E+0BC6), ஞே(0B9E+0BC7), ஞை(0B9E+0BC8), ஞொ(0B9E+0BCA),
> ஞோ(0B9E+0BCB), ஞௌ(0B9E+0BCC) are almost unused and most Tamil symbols less
> used. We can assign them to more bits instead of the 16 bits they are
> assigned to, as they are occupying space with almost no use.
>
Indeed. This is the basis for Huffman Coding (see
http://en.wikipedia.org/wiki/Huffman_coding ). And it should be
considered when compressing text. But if you are suggesting that the
codings in Unicode be changed, that really won't work, for several reasons.
For one thing, Unicode has all these stability regulations: they are not
going to change anything that's already been assigned (even if it's
actually wrong!) Too much depends on what is already done to allow that.
Also, Unicode is generally about assigning codes to characters, and the
simplest way to do that is to assign codes of the same length to
everything. This is not the most efficient way in terms of bit-length,
as you point out, but that isn't the point of Unicode. For efficiency
in those terms, there are compression algorithms, like Huffman coding
and others. And that makes sense, too. Doing a general Huffman coding
over ALL of the Unicode characters and their general usage across the
whole corpus as it stands now would be very inefficient when applied to
individual documents. A document written in (say) Phags-Pa would
probably take a lot more bits per character than one written in ASCII,
because Phags-Pa has much less usage altogether, but if we do the
Huffman coding *afterwards*, based only on the frequency of that
document, then the rarity of Phags-Pa with respect to Latin letters no
longer matters, and we wind up with much shorter codes for the letters
we are actually using.
Those characters aren't "occupying space". They only occupy space when
you use them, which as you said is not very often.
~mark
This archive was generated by hypermail 2.1.5 : Sun Feb 13 2011 - 10:49:20 CST