From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 16:19:15 CDT
There's no way to avoid using more than one byte per character if
you're using Unicode since there are more than 256 characters. If
you use UTF-32, every char is four bytes. If you use UTF-8, characters
take from one to four bytes depending on where the corresponding codepoint
is. If you use UTF-16, every character in the BMP is two bytes, any character
outside of the BMP takes four bytes.
The downside of UTF-16 and UTF-8 is that characters are not the same
length, which makes processing more complicated. With UTF-16, however,
if you know that there are no characters outside the BMP, every
character is a constant two bytes wide.
Bill
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 16:21:23 CDT