From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 20 2006 - 18:33:22 CDT
William Poser wrote:
> I'm confused as to the sense in which C and C++
> "don't support the Unicode character model".
Before Philippe weighs in with the inevitable opus ;-),
I think the main point is that neither C nor C++ have
a native CHARACTER datatype that is based on Unicode.
And for many years partisans of C and C++ have claimed
that that was a *good* thing, because it meant that
programs could be written "portably", to not care
what charset they were running under.
Personally I always considered that a misconstrual of
what it meant to write portable code, but that is
perhaps for another thread...
> It is
> very easy to manipulate objects of type wchar_t,
> arrays thereof, linked lists thereof, and so forth.
> The main theoretical difficulty that I see with Unicode
> processing in C is that you can't be sure that a wchar_t
> is at least 21 bits wide. This is of course a general
> defect of the C standard, which does not specify
> object sizes. In practice, however, I haven't myself
> encountered problems with this or heard of them.
Recently, in part at the urging of the Unicode Consortium,
the C and C++ standards have finally added data types which are
guaranteed 16-bit and 32-bit widths and which are tied
nominally to 10646/Unicode character semantics, although
there is no built-in support for anything more than just
the fixed-width nature of the data type -- and you would
have to have a very recent compiler to actually recognize
them.
Personally, I consider manipulating Unicode characters
as wchar_t to be a mistake, because of the portability
issues. I've been writing and maintaining C libraries for
Unicode support for years, but never use wchar_t *anywhere*
in that code. Instead, I use fundamental declared datatypes that
I can guarantee data width for via the compiler-specific
makefiles.
With this approach I have complex libraries that support
UTF-8, UTF-16, *and* UTF-32 flawlessly across all Unix
platforms and Windows and a variety of oddball platforms
on occasion, for 32-bit and 64-bit processors, with internal
change of form (e.g. UTF-16 <--> UTF-32) on an as-needed
basis, depending on what kind of text processing is needed.
And that code has an absolute minimum of platform-specific
conditional compilation -- all of it related to concerns
like file paths and such that have nothing to do with
Unicode processing per se.
> For the present, at least, there is also a good reason
> to use C IN PREFERENCE to high level languages for
> processing Unicode, for some applications. The
> high-level languages that I know of all limit
> Unicode support to the BMP. That is true of Python
> and Tcl, for example. In contrast, in C there
> is no such limitation.
I concur with this, if you are rolling your own support.
C also has very good performance characteristics, both
for what you can do in minimizing memory usage and in
maximizing speed.
On the other hand, for most people looking for Unicode
support, the most practical approach is to make use of
a big, full-featured Unicode library -- most notably
the Open Source ICU library, available both in C and
Java versions. The developers of such libraries have
already done really outstanding jobs of optimizing
their behavior and keep them up-to-date and compliant
with the current standard.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 18:37:14 CDT