From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Jul 16 2007 - 13:37:11 CDT
On Mon, 16 Jul 2007, Magda Danish (Unicode) wrote
(quoting Daniel Johnson):
> I have a question about how much space Unicode takes up. I am working
> on a HTML project in multiple languages. Each of these web pages have to
> be stored on a chip with limited space. Is there any way to "compact"
> the HTML scripts in order to save space on the chip? Or is there a
> different call number for a character which will take up less space in
> hex?
If you use UTF-8, which is almost always the right encoding for a Unicode
encoded HTML document, then all ASCII characters occupy one byte (octet)
each, just as in ASCII encoding and in ISO 8859 encodings. This means in
particular that HTML markup, as well as any embedded CSS or JavaScript
code, takes the same amount of bytes as in using ASCII.
For textual content, the situation is different and depends on the
character repertoire used, which in turn depends on the language. One
Unicode character may use up to four bytes. Thus, there is a potential
problem and potential loss of space efficiency as compared with other
encodings. Using UTF-8 for all pages is, however, a simple approach and
saves some headache.
When space requirements are essential, you might consider using some
general compression method such as gzip. It is widely used for web
documents, and it can be used for HTML documents as well for other data,
and web browsers can decode it automatically (when the compression is
adequately indicated in HTTP headers: Content-Encoding: gzip). Things
might be more difficult if you plan to make the files usable directly and
not via an HTTP server.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 13:56:19 CDT