From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 14:29:32 CDT
There seem to be religious views on this question, but my own practice is
to use UTF-32 internally in almost all cases. Yes, it takes more memory
than UTF-8, but the modest additional memory usage doesn't really matter
much. On the other hand, dealing with UTF-32 is much easier and less error
prone than dealing with UTF-8. Every four bytes is a character. You can do
simple array arithmetic, simple calculations of how much memory you need
to allocate, etc.
Of course, if you are willing and able to do everything
using library functions and have a suitable UTF-8 library, then
you need not worry about the complications of operating on UTF-8.
So the choice depends in part on what kind of processing you are doing.
If you don't want to use ICU, I don't know of a single cross-platform
library that covers everything, but there are some that cover a lot.
For example, for regular expressions I recommend the TRE library:
http://www.laurikari.net/tre/. It is lightweight, robust, provides POSIX
regular expressions as well as extensions such as the best approximate
matching facilities that I am aware of, and has both multibyte and
wide character APIs.
I don't write for MS Windows so I don't worry too much about the fact
that wchar_t is only two bytes (is this true on Vista, by the way?), but
the fact that you can't assume that a wchart_t is large enough to hold
any Unicode character is indeed a real problem. I'd like to see all of
the wcs functions redone for defined sizes, e.g. uint32_t.
Bill
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 14:31:23 CDT