Re: FW: Subj: Amount of Space Unicode Takes

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 16 2007 - 14:35:03 CDT

  • Next message: Kent Karlsson: "RE: Generic base characters"

    Daniel Johnson asked:

    > I have a question about how much space Unicode takes up.

    On web pages, typically only a small fraction of the overall
    storage space for the HTML and other content for those pages.

    > I am working on a HTML project in multiple languages. Each
    > of these web pages have to be stored on a chip with limited space.

    Take the Unicode home page as an example. www.unicode.org
    That page has a little over 2000 characters of text content
    displayed on it currently. But the page size of the HTML
    (today) is 24,706 bytes. And when it is displayed, it also
    loads the Unicode logo (1111 bytes), the two jpgs for the
    book and conference (20,711 bytes and 69,906 bytes, respectively)
    and loads two member logos that vary from 1K to about 6K in
    size.

    So for roughly 2000 characters of text content, you have
    roughly 120,000 bytes worth of HTML structure and graphics,
    even on a page which is relatively devoid of fancy graphic
    devices (no Flash or anything of that sort).

    Now consider translating that page into Chinese. Assuming the
    original text content were expressed in UTF-8 -- since most
    of it is ASCII, it takes a little over 2000 bytes in the HTML.
    Chinese would, on average, take about a third as many characters
    to express the same content, but each character would require
    3 bytes in UTF-8. So in Chinese, you might end up requiring
    about 700 characters x 3 bytes each, or roughly 2100 bytes
    in the HTML. Essentially no effective difference in text size
    overall.

    > Is there any way to "compact" the HTML scripts in order to save
    > space on the chip?

    I'm sure there are. Someone else familiar with embedded
    applications might be able to speak with that.

    > Or is there a different call number for a character which
    > will take up less space in hex?

    I presume (but am not certain) that you are referring to the
    numerical character references for Unicode characters.

    If you just use UTF-8 directly, for example, then the
    character U+2022 would be expressed in 3 bytes in the HTML.

    If you use a numeric reference instead, that would be
    "•", which is 8 bytes of ASCII.

    You could save one byte by using a decimal numeric character
    reference, instead of a hexadecimal one: "•" or 7
    bytes of ASCII. But you'd be back to 8 bytes for Chinese
    characters, for example, because the decimal values get
    larger than 9999 for those.

    In general, you would be much better off just keeping your
    web pages in UTF-8 and avoiding numeric character references,
    if you are counting bytes for the text on the page.

    But do a realistic assessment of how much of your HTML is
    not basic plain text content before you start worrying too
    much about how much of a storage penalty your pages will
    have for translating them into multiple languages and
    using Unicode as your character set.

    Actually, the Unicode website has a *very* good example for
    you of the approximate impact on page size of translation
    of web pages that are focussed mostly on text.

    http://www.unicode.org/standard/WhatIsUnicode.html

    is translated into 52 languages now, all using Unicode (in UTF-8)
    on the web pages.

    The WhatIsUnicode.html page itself is 18,966 bytes in size.

    The translated pages range from a low of 8795 bytes (for
    Simplified Chinese) to a high of 18,336 bytes (for Uyghur).
    A European language without too many accents typically comes
    in around 10K, while South Asian languages run about 13K - 15K
    or so.

    The English original text isn't listed separately as a
    translated page, but if it were would come in around 9K,
    by way of comparison. The reason why the WhatIsUnicode.html
    page itself is 18,966 bytes in size is because it also contains
    the long, long index for all 52 translated pages.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 14:36:16 CDT