From: John H. Jenkins (jenkins@apple.com)
Date: Mon Jul 07 2008 - 18:18:17 CDT
On Jul 7, 2008, at 3:19 PM, William J Poser wrote:
> There's no way to avoid using more than one byte per character if
> you're using Unicode since there are more than 256 characters. If
> you use UTF-32, every char is four bytes. If you use UTF-8, characters
> take from one to four bytes depending on where the corresponding
> codepoint
> is. If you use UTF-16, every character in the BMP is two bytes, any
> character
> outside of the BMP takes four bytes.
>
This isn't as much of an advantage as it sounds, since in most Unicode
processes you need to be prepared to deal with multiple characters at
once anyway.
> The downside of UTF-16 and UTF-8 is that characters are not the same
> length, which makes processing more complicated. With UTF-16, however,
> if you know that there are no characters outside the BMP, every
> character is a constant two bytes wide.
>
That's the problem. You really can't make the assumption that you're
dealing with BMP-only text.
=====
John H. Jenkins
jenkins@apple.com
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 18:21:45 CDT