From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jan 16 2004 - 12:53:34 EST
From: "Rick Cameron" <Rick.Cameron@businessobjects.com>
> Unfortunately, you cannot use UTF-8 as the default MBCS code page in
> Windows. In other words, Windows does not support the equivalent to
setting
> the locale to xxx.UTF-8 in unix.
Exactly. However the conversion to UTF-8 from UTF-16 (the Windows "WideChar"
encoding used in the Win32 Unicode API) is supported natively in
MultibyteToWideChar() as if it was a SBCD/DBCS character set, even on
Windows 95.
> But the good news is that in Windows (unlike unix), wchar_t always means
> UTF-16. And UTF-16 is a whole lot more convenient to work with than UTF-8!
I fully agree there. UTF-16 is really convenient as the main encoding to use
for Windows programming and interaction with the Win32 API, specially if you
are on Windows 95, because Windows 95 will require you convert first UTF-8
to UTF-16 wide chars before you can map it to the ACP (or OEMCP) codepage
which are often the only one really supported and working in many areas.
With UTF-16 you need only 1 conversion, and if you are working on an
application that needs internationalization, it's just easier to port your
application from the 8-bit ACP/OEMCP legacy codepages to the "WideChar"
UTF-16 encoding. UTF-16 will not cost you much more resources than UTF-8 for
all non-English users (after all if you are internationalizing your
application, it's legitimate to think that ASCII will not be the only
characters that your application will use, and for other European languages,
the price of UTF-16 face to UTF-8 is not excessive. On the opposite the
price of UTF-8 for non European users is prohibitive: UTF-16 competes well
face to legacy Asian MBCS charsets.)
In conclusion, on Windows, UTF-16 is the encoding that requires the least
number of conversions performed at your programmatic level, so you save
performance by avoiding conversions and allocation of working buffers, and
various copies of the string in multiple encodings (notably on Windows95).
So you'll need to worry about legacy charsets only if you need to use some
legacy DOS APIs for console apps on Windows 95 (and here again, only 1
conversion is needed within your console support layer for standard input &
standard output/error).
If you need a serialization of UTF-16 for file storage, it can be performed
on the fly to UTF-8 with very basic code in your file or stream layer,
without needing complex buffer management and complex management of encoding
issues; the only thing you must take care about is the possible bogous
presence of unpaired surrogates, something that won't affect you immediately
if you have never used UTF-16 data before, and you design your string
handling routines to preserve UTF-16 pairs, or if your first step is to
support languages which still don't need surrogates for characters out of
the BMP (i.e. today, mostly extended Chinese): you can make sure your
program will not break there in the future by making sure that you won't
accept surrogates on input until you have verified that your string handling
routines are preserving surrogate pairs, even in the case you need to
perform truncation of strings with bounded lengths.
This archive was generated by hypermail 2.1.5 : Fri Jan 16 2004 - 13:27:46 EST