From: Addison Phillips (addison@yahoo-inc.com)
Date: Mon Jul 16 2007 - 14:08:56 CDT
The question of how much space Unicode takes depends on a number of
factors. For HTML, the character encoding usually used for Unicode is
UTF-8. UTF-8 is a variable width encoding. Each character takes between
one and four bytes to encode. The four-byte characters are exceedingly
rare in practice.
ASCII characters, which include most markup (tags, etc.), take one byte
per character. Non-ASCII characters for a number scripts (such as those
used to write most European languages) take two bytes per character. The
characters for other scripts (and thus languages) take three bytes each.
This does not mean that no three-byte characters will appear in, say, an
English document, please note.
> Or is there a different call number for a character which
> will take up less space in hex?
If you are storing characters as numeric entities (覫 or
〹), you should note that this takes more space than using UTF-8
to encode the characters. There are various Unicode-specific compression
schemes (SCSU, BOCU-1, etc.), but these are probably more trouble than
they are worth for your application.
Ultimately, if you want to save space, compressing the files using
normal compression schemes probably saves you the most storage (with the
loss of performance due to the need to uncompress the files at runtime),
since the majority of your "text" is going to be markup (HTML tags and
the like, which are ASCII).
Hope that helps.
Best Regards,
Addison
-- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair -- W3C Internationalization Core WG Internationalization is an architecture. It is not a feature. Magda Danish (Unicode) wrote: > Daniel, > I am forwarding your question to the Unicode mailing list http://www.unicode.org/consortium/distlist.html for possible help from list subscribers. > Regards, > > --------------------------- > Magda Danish > Sr. Administrative Director > The Unicode Consortium > 650-693-3921 > magda@unicode.org > > > > -----Original Message----- > Date/Time: Fri Jul 13 12:58:18 CDT 2007 > Contact: dbjohnson88@hotmail.com > Name: Daniel Johnson > Report Type: Other Question, Problem, or Feedback Opt Subject: Amount of Space Unicode Takes > > I have a question about how much space Unicode takes up. I am working on a HTML project in multiple languages. Each of these web pages have to be stored on a chip with limited space. Is there any way to "compact" the HTML scripts in order to save space on the chip? Or is there a different call number for a character which will take up less space in hex? It would be greatly appreciated if the email was answered. > > Thank you > > Daniel Johnson > > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report) > > >
This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 14:10:29 CDT