Re: FW: Subj: Amount of Space Unicode Takes

From: Addison Phillips (addison@yahoo-inc.com)
Date: Mon Jul 16 2007 - 14:08:56 CDT

  • Next message: Asmus Freytag: "Re: Generic base characters - From Phetsarath Lao font"

    The question of how much space Unicode takes depends on a number of
    factors. For HTML, the character encoding usually used for Unicode is
    UTF-8. UTF-8 is a variable width encoding. Each character takes between
    one and four bytes to encode. The four-byte characters are exceedingly
    rare in practice.

    ASCII characters, which include most markup (tags, etc.), take one byte
    per character. Non-ASCII characters for a number scripts (such as those
    used to write most European languages) take two bytes per character. The
    characters for other scripts (and thus languages) take three bytes each.
    This does not mean that no three-byte characters will appear in, say, an
    English document, please note.

    > Or is there a different call number for a character which
    > will take up less space in hex?

    If you are storing characters as numeric entities (覫 or
    〹), you should note that this takes more space than using UTF-8
    to encode the characters. There are various Unicode-specific compression
    schemes (SCSU, BOCU-1, etc.), but these are probably more trouble than
    they are worth for your application.

    Ultimately, if you want to save space, compressing the files using
    normal compression schemes probably saves you the most storage (with the
    loss of performance due to the need to uncompress the files at runtime),
    since the majority of your "text" is going to be markup (HTML tags and
    the like, which are ASCII).

    Hope that helps.

    Best Regards,

    Addison

    -- 
    Addison Phillips
    Globalization Architect -- Yahoo! Inc.
    Chair -- W3C Internationalization Core WG
    Internationalization is an architecture.
    It is not a feature.
    Magda Danish (Unicode) wrote:
    >  Daniel,
    > I am forwarding your question to the Unicode mailing list http://www.unicode.org/consortium/distlist.html for possible help from list subscribers.
    > Regards,
    > 
    > ---------------------------
    > Magda Danish
    > Sr. Administrative Director
    > The Unicode Consortium
    > 650-693-3921
    > magda@unicode.org
    > 
    > 
    > 
    > -----Original Message-----
    > Date/Time:    Fri Jul 13 12:58:18 CDT 2007
    > Contact:      dbjohnson88@hotmail.com
    > Name:         Daniel Johnson
    > Report Type:  Other Question, Problem, or Feedback Opt Subject:  Amount of Space Unicode Takes
    > 
    > I have a question about how much space Unicode takes up. I am working on a HTML project in multiple languages. 
    Each of these web pages have to be stored on a chip with limited space. 
    Is there any way to "compact" the HTML scripts in
    order to save space on the chip? Or is there a different call number for 
    a character which will take up less space in hex? It would be greatly 
    appreciated if the email was answered.
    > 
    > Thank you
    > 
    > Daniel Johnson
    > 
    > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)
    > 
    > 
    > 
    


    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 14:10:29 CDT