I agree, the right approach is to look at some real data. And best is to
look not at raw byte proportions per character, but at real UTF-8 text with
equivalent translated content. There are a number of translated pages with
the following sizes linked through:
http://www.unicode.org/unicode/standard/WhatIsUnicode.html
09,618 s-chinese.html
09,682 t-chinese.html
10,110 esperanto.html
10,279 maltese.html
10,475 icelandic.html
10,632 czech.html
10,660 welsh.html
10,808 danish.html
10,856 swedish.html
10,863 polish.html
10,864 spanish.html
10,955 interlingua.html
11,000 italian.html
11,038 lithuanian.html
11,044 portuguese.html
11,096 romanian.html
11,106 german.html
11,134 korean.html
11,281 french.html
11,462 japanese.html
13,892 persian.html
14,808 WhatIsUnicode.html*
14,028 greek.html
14,632 russian.html
15,218 hindi.html
15,853 deseret.html
16,069 georgian.html
18,185 arabic.html*
Hindi is about the same as Greek or Russian, and about 37% more than German.
But notice that when we look at the figures, it would appear that the
Unicode consortium is favoring Chinese over all European languages!
Yet as has been pointed out, even the comparisons here are not really
representative in terms of total web content, since they have so few
graphics. With a higher proportion of images to text, the differences in the
text size are completely swamped.
Mark
* The Arabic page has a lot of crufty HTML carried over from MS Word;
otherwise I would expect it to take about the same room as Persian.
* The English page (WhatIsUnicode.html) has an overstated byte count, since
it has the index on it.
—————
Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
----- Original Message -----
From: "David Starner" <starner@okstate.edu>
To: "Aman Chawla" <creativezeal@hotmail.com>
Cc: "Unicode" <unicode@unicode.org>
Sent: Sunday, January 20, 2002 22:27
Subject: Re: Devanagari
> On Mon, Jan 21, 2002 at 12:39:58AM -0500, Aman Chawla wrote:
> > > What's your point in continuing this? Most of the people on this list
> > > already know how UTF-8 can expand the size of non-English text.
> >
> > The issue was originally brought up to gather opinion from members of
this
> > list as to whether UTF-8 or ISCII should be used for creating Devanagari
web
> > pages. The point is not to criticise Unicode but to gather opinions of
> > informed persons (list members) and determine what is the best encoding
for information
> > interchange in South-Asian scripts...
>
> That's sort of like going into a Islamic shrine and asking who the one
> true god is. The answer they will give is predicatable, and arguing
> about the answer will start to annoy people, especially if you don't
> seem to be listening.
>
> And you don't seem to be listening. The factor is not a factor of 3.
> UTF-16, which IE supports (I believe) and Netscape 6 supports, will give
> you a constant factor of 2. If you use UTF-8, HTML markup will
> make the factor considerably smaller, and if you have many graphics,
> their size will easily dwarf that of the text.
>
> For a comparison, yahoo.com sans graphics is 20k, 6k of text and 14k of
> HTML. A Devangari page, therefor, should be about 32k, a factor of 1.5,
> not 3.
>
> --
> David Starner - starner@okstate.edu, dvdeug/jabber.com (Jabber)
> Pointless website: http://dvdeug.dhis.org
> When the aliens come, when the deathrays hum, when the bombers bomb,
> we'll still be freakin' friends. - "Freakin' Friends"
>
>
This archive was generated by hypermail 2.1.2 : Mon Jan 21 2002 - 11:35:50 EST