From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 16 2007 - 14:35:03 CDT
Daniel Johnson asked:
> I have a question about how much space Unicode takes up.
On web pages, typically only a small fraction of the overall
storage space for the HTML and other content for those pages.
> I am working on a HTML project in multiple languages. Each
> of these web pages have to be stored on a chip with limited space.
Take the Unicode home page as an example. www.unicode.org
That page has a little over 2000 characters of text content
displayed on it currently. But the page size of the HTML
(today) is 24,706 bytes. And when it is displayed, it also
loads the Unicode logo (1111 bytes), the two jpgs for the
book and conference (20,711 bytes and 69,906 bytes, respectively)
and loads two member logos that vary from 1K to about 6K in
size.
So for roughly 2000 characters of text content, you have
roughly 120,000 bytes worth of HTML structure and graphics,
even on a page which is relatively devoid of fancy graphic
devices (no Flash or anything of that sort).
Now consider translating that page into Chinese. Assuming the
original text content were expressed in UTF-8 -- since most
of it is ASCII, it takes a little over 2000 bytes in the HTML.
Chinese would, on average, take about a third as many characters
to express the same content, but each character would require
3 bytes in UTF-8. So in Chinese, you might end up requiring
about 700 characters x 3 bytes each, or roughly 2100 bytes
in the HTML. Essentially no effective difference in text size
overall.
> Is there any way to "compact" the HTML scripts in order to save
> space on the chip?
I'm sure there are. Someone else familiar with embedded
applications might be able to speak with that.
> Or is there a different call number for a character which
> will take up less space in hex?
I presume (but am not certain) that you are referring to the
numerical character references for Unicode characters.
If you just use UTF-8 directly, for example, then the
character U+2022 would be expressed in 3 bytes in the HTML.
If you use a numeric reference instead, that would be
"•", which is 8 bytes of ASCII.
You could save one byte by using a decimal numeric character
reference, instead of a hexadecimal one: "•" or 7
bytes of ASCII. But you'd be back to 8 bytes for Chinese
characters, for example, because the decimal values get
larger than 9999 for those.
In general, you would be much better off just keeping your
web pages in UTF-8 and avoiding numeric character references,
if you are counting bytes for the text on the page.
But do a realistic assessment of how much of your HTML is
not basic plain text content before you start worrying too
much about how much of a storage penalty your pages will
have for translating them into multiple languages and
using Unicode as your character set.
Actually, the Unicode website has a *very* good example for
you of the approximate impact on page size of translation
of web pages that are focussed mostly on text.
http://www.unicode.org/standard/WhatIsUnicode.html
is translated into 52 languages now, all using Unicode (in UTF-8)
on the web pages.
The WhatIsUnicode.html page itself is 18,966 bytes in size.
The translated pages range from a low of 8795 bytes (for
Simplified Chinese) to a high of 18,336 bytes (for Uyghur).
A European language without too many accents typically comes
in around 10K, while South Asian languages run about 13K - 15K
or so.
The English original text isn't listed separately as a
translated page, but if it were would come in around 9K,
by way of comparison. The reason why the WhatIsUnicode.html
page itself is 18,966 bytes in size is because it also contains
the long, long index for all 52 translated pages.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 14:36:16 CDT