RE: UTF-8 on NT

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Tue Sep 04 2001 - 18:47:07 EDT

Previous message: Michael \(michka\) Kaplan: "Re: TOP/BOTTOM HORIZONTAL BOX LINE: new characters?"
In reply to: Changjian_Sun@i2.com: "RE: UTF-8 on NT"
Next in thread: Yves Arrouye: "RE: UTF-8 on NT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Changjian Sun,

If you have code that is currently setlocale based then there is an easy
conversion to Unicode. With xIUA http://www.xnetinc.com/xiua/ you have a
straight migration path to use the wonderful power of ICU.

You start by replacing your current i18n functions such as setlocale() to
xiua_OpenLocale and strcoll() to xiua_strcoll etc. xiua_OpenLocale takes
POSIX style locales like xiua_OpenLocale("fr_CA.cp1252,XDFUTF8); (French
Canada with an associated character set of windows-1252 and the data will be
represented in UTF-8 so that a xiua_ChartoNative function will convert 1252
to UTF-8) Unlike setlocale it will be thread safe. The one logic change is
that before you terminate the thread you must close the locales then
terminate the xIUA thread support to retain resources. You can start out in
code page mode so that you can test the code as you migrate. Then you can
switch to UTF-8 with no code changes by opening the locale in UTF-8 (change
XDFCODEPAGE to XDFUTF8).

For example:

LocJP_u2 = xiua_OpenLocale(LocJPWin,XDFUTF16);
LocJP_list[0] = &LocJP_u2;
LocJP_u4 = xiua_OpenLocale(LocJPWin,XDFUTF32);
LocJP_list[1] = &LocJP_u4;
LocJP_u8 = xiua_OpenLocale(LocJPWin,XDFUTF8);
LocJP_list[2] = &LocJP_u8;
LocJP_win = xiua_OpenLocale(LocJPWin,XDFCPWIN);
LocJP_list[3] = &LocJP_win;
LocJP_unix = xiua_OpenLocale(LocJPUnix,XDFCPUNIX);
LocJP_list[4] = &LocJP_unix;
LocJP_cp = xiua_OpenLocale(LocJPWin,XDFCODEPAGE);

for (j=0;j<4;j++)
{
  xiua_SetLocaleHdl(*LocJP_list[j]);
  strcpy(test_buff,"String Test ");
  strcat(test_buff,Loc_DataFormat[j].dfmt);
  runTestb(test_buff, &line);
}

This test uses the same string handling routines to process UTF-32, UTF-16,
UTF-8 and code page data.

It works by converting arguments and results for routines like xiua_strcoll.
The underlying code will invoke the ICU collation code. If you want more
flexibility you can invoke xiua_strcollEx to provide different strength,
case, and normalization values. For the full power of ICU you can invoke
any ICU API directly.

For functions like strtok you can invoke xiua_strtok. This is implemented
differently in that there as separate UTF-32, UTF-16, UTF-8 and code page
implementations. Also unlike strtok it is thread safe. There is a
xiua_strtok_r implementation with give you this capability even on Windows
platforms which do not support it.

This code is designed to be modified so it implements xiua_strcmp using
Unicode code point order for UTF-32, UTF-16 & UTF-8 so that they all compare
equally. If you don't like it that just change the code. However if you
are using a database which does use Unicode point order compares then you
might want all forms to compare the same.

It even supports multiple open locales per thread so that you can have
HTML/XML files in EUC-JP and a database using UTF-8 SQL to retrieve UTF-16
data and communications to a browser running Shift_JIS. Some calls like
xiua_LocaletoLocale use two open locales and will convert the data from the
format of one locale to the format of the other or just copy it if the
formats are the same.

This is all open source code so you are not locked in. xIUA code is really
a starter application that is designed to be integrated into your own
application code and changed to suit your needs.

Carl

  -----Original Message-----
  From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Changjian_Sun@i2.com
  Sent: Tuesday, September 04, 2001 10:54 AM
  To: Vaintroub, Wladislav
  Cc: unicode@unicode.org
  Subject: RE: UTF-8 on NT

Do you think that UTF-8 is the wrong way for internationalization of
cross-platform software ?

  Our application supports solaris, hpux, aix and NT. To internationalize
it, we are
  thinking of UTF-8 (setlocale(), strcoll() for sorting, mbstowcs() for
length...) so that
  we don't have to add wide character (wchar_t) data type everywhere in
source code.
  I did some tests on unix, setlocale(), strcoll(), mbstowcs() look ok for
UTF-8,
  but I am stuck with NT.

Do you mean I have to use a different approach for NT internationalization
?

I'm also thinking of 3rd party UTF-8 support such as libutf8, IBM ICU.
They seem no good supports on NT, what do you think ?

Thanks.
-Changjian Sun

"Vaintroub, Wladislav" <Wladislav.Vaintroub@softwareag.com>
09/04/01 01:40 PM

                To: "'Changjian_Sun@i2.com'" <Changjian_Sun@i2.com>,
unicode@unicode.org
                cc:
                Subject: RE: UTF-8 on NT

I'm afraid ,that there no way to set UTF-8 locale on Windows via
setlocale. Even if you try to do this with setlocale("French_Canada.65001")
it won't work correctly.
It's a pitty , because the porting of Unix programms,relying on UTF-8
locale becomes very challenging task on Windows.

Wladislav Vaintroub.

  -----Original Message-----
  From: Changjian_Sun@i2.com [mailto:Changjian_Sun@i2.com]
  Sent: Tuesday, September 04, 2001 6:36 PM
  To: unicode@unicode.org
  Subject: UTF-8 on NT

  Not like in unix, we can set French UTF-8 locale by calling
      setlocale(LC_ALL, "fr_CA.UTF-8"),
  On NT, I don't know how to set French UTF-8 locale,
  setlocale(LC_ALL, "French_Canada.1252") seems not for UTF-8

  My questions:
  1. Is UTF-8 supported on NT ?
  2. If yes, how to use setlocale() to set it up ?
  Thanks.

-Changjian Sun

Previous message: Michael \(michka\) Kaplan: "Re: TOP/BOTTOM HORIZONTAL BOX LINE: new characters?"
In reply to: Changjian_Sun@i2.com: "RE: UTF-8 on NT"
Next in thread: Yves Arrouye: "RE: UTF-8 on NT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Sep 04 2001 - 19:35:26 EDT