RE: UTF-8 vs UTF-16 as processing code

From: Michael Kaplan (Trigeminal Inc.) (v-michka@microsoft.com)
Date: Fri Jun 16 2000 - 16:04:49 EDT


To Windows 2000 (and Windows NT circa SP4 as well), UTF-8 is another
multibyte encoding, which you can get to via "code page 65001" and
MultiByteToWideChar and get from via WideCharToMultiByte. So the only
difference between it and any other code page, be it iso-8859-1 or
windows-1252 is that happens to cover all languages. :-)

(Major understatement, obviously this makes it quite an important code page,
especially since products like IIS 5.0 and FrontPage 2000 only support
Unicode via UTF-8!)

There is no way to natively make all of the "W" functions in Windows accept
UTF-8, you would have to convert everything (if you do not, then the kernel
itself will have to convert what you pass to the "A" functions at some point
anyway). The "A" functions do not accept UTF-8 in most cases and usually
they ar assuming the default system code page, which can never be UTF-8 (it
is based off default system locale).

Michael

> ----------
> From: Jones, Bob[SMTP:Bob_Jones@jdedwards.com]
> Sent: Friday, June 16, 2000 12:09 PM
> To: Unicode List
> Subject: RE: UTF-8 vs UTF-16 as processing code
>
> I have the same question. And, if you do go UTF-8 for processing, how
> does
> that work with Windows NT/2000? Is it even possible to have input come in
> as UTF-8? If you compile with Unicode turned on, it seems to
> automatically
> be UCS-2.
>
> Bob
>
> -----Original Message-----
> From: erik@netscape.com [mailto:erik@netscape.com]
> Sent: Friday, June 16, 2000 11:26 AM
> To: Unicode List
> Subject: UTF-8 vs UTF-16 as processing code
>
>
> Hi everybody,
>
> I'm wondering if there are any analyses comparing UTF-8 with UTF-16 for
> use as a processing code. UCS-2 has often been considered a good
> representation to use internally inside a program because of its "fixed
> width" properties (assuming that you can somehow deal with combining
> marks, etc), but UTF-16 clearly isn't fixed width, especially now that
> Unicode and 10646 are about to actually assign characters beyond U+FFFF.
>
> The kind of analysis I have in mind is one that lists various pros and
> cons for each representation. I had a quick look at the Unicode 3.0
> book, but I haven't read all of it yet. Does anybody have any pointers
> to such analyses, e.g. URLs, books, etc?
>
> Thanks,
>
> Erik
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT