Re: Devanagari

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Jan 20 2002 - 18:34:29 EST


At 12:48 AM 1/20/02 -0800, James Kass wrote:
>The arguments about relative size are true, but in this day and age are
>considered unimportant. Graphics files are extremely large in comparison
>with text files of any script and so are sound files. Devanagari UTF-8 is
>three bytes. The four byte UTF-8 sequences so far are only used for
>Plane One Unicode and up.

If the argument refers to 4-byte sequences for Devanagari, it is not
factually 'true', as James points out.

More to the point is the following observation: HTML or similar mark-up
languages account for an ever growing percentage of transmission of
"text" - even in e-mail.

The fact that UTF-8 economizes on the storage for ASCII characters, is a
benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and
claims a significant fraction of the data.

A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead
for Devanagari as claimed. Add to that James' observation on graphics files,
many of which accompany even the simplest HTML documents and you get a
percentage difference between the sizes of an English and Devanagari website
(i.e. in its entirety) that's well within the fluctuation of the typical
length in characters, for expressing the same concept in different languages.

In other words, contrary to the claims made by the argument, it is hard to
predict that this structure of UTF-8 will have an observable impact on
exchanging data - other than psychological perhaps.

In many size constrained application areas it may pay off to do compression.
http://www.unicode.org/unicode/reports/tr6 shows how one can compress
Unicode Data in Devanagari to a size comparable to that of 8-bit ISCII.
However, interchange of this format (SCSU) requires consenting parties.

A./



This archive was generated by hypermail 2.1.2 : Sun Jan 20 2002 - 17:45:40 EST