Re: wide chars and methods

From: Bob Eaton (pete_dembrowski@hotmail.com)
Date: Sat Oct 29 2005 - 01:21:18 CST

  • Next message: Peter Constable: "RE: ANSI and Unicode for x00 - xFF"

    Vidya,

    The reason 'char' works is because you're probably not building the app with the _UNICODE compiler define and so yours is really an "Ansi" app. Ansi apps work on NT-based OSs for those ranges of Unicode that have code page support (c.f. the other thread on the Unicode list about "ANSI and Unicode for x00 - xFF").

    So Japanese will work because NT provides code page 932 (I think) to turn wide characters into narrow (char) characters and vise versa automatically. As Murray and Michael pointed out in that other thread, however, this won't work for Devanagari because Devanagari doesn't have "Ansi" code page support.

    If you build your app with _UNICODE, then 'char' won't work because each character will be (nominally) two bytes.

    The reason you have to do setlocale is so you can tell the system what code page to use to convert those single byte (or with Japanese, double-byte) characters into wide/Unicode and vise versa (aside: this is probably only true if the code page you set in setlocale is different from the default system code page--i.e. if the default system code page is already set to 932, then you probably don't have to do setlocale).

    Keyboard input and the file i/o are completely different things. The keyboard will probably work if you use setlocale so the system will know what code page to send you narrow Ansi characters (having converted them from wide/Unicode). The file i/o will be dependent on whether you're using the _UNICODE switch or not. If it's not set, then you'll read/write single (or double) "Ansi" byte(s) for each character. If set, then you'll read/write UTF-16 words (nominally, 2 bytes) for each character.

    If you use TCHAR, then it works in both Ansi (aka. _MBCS) and _UNICODE mode. If you use wchar_t, it only make sense if you then have _UNICODE set.

    Suggestion: I avoid wchar_t myself, because if you end up linking with other libraries, they have to be using wchar_t also or the link fails.

    Bob

    P.S. If you define the compiler switch _UNICODE, be sure to remove the _MBCS switch with which it is mutually exclusive

    P.S.S. Not that I'm an expert and I'm sure that others could do a better job, but I've agreed to do a "webchat" on the topic of encoding conversion and porting VC++ 6.0 programs to support Unicode. If you're interested, here's the link: http://bhashaindia.com/events/chat/.
      ----- Original Message -----
      From: Vidya Maheshwar Nabar
      To: Shawn Steele ; Dominikus Scherkl
      Cc: unicode@unicode.org
      Sent: Friday, October 28, 2005 2:41 PM
      Subject: RE: wide chars and methods

       

      Hi Shawn/Dominikus,

       

      Thanks for the response.

       

      I don't want to use TCHAR around the code for some reason. I thought I should be using 'wchar_t' for holding Japanese text, but 'char ' seemed to work fine. As for wchar_t, I could see Japanese strings only after I made a setlocale( LC_ALL, "" ) call, (though I still cannot 'type in' Japanese; when the focus is in the console, the Japanese IME which is otherwise displayed, disappears), hence the question about wide char/method usability.

      - Why do we need setlocale when using wchar_t to display Japanese strings? Even with this workaround, I'm still unable to enter Japanese strings via console.

      - This distinction doesn't seem to extend to file streams. So, does console i/o differ from disk i/o? I tried reading from/writing to files with Japanese strings using both char and wchar_t (sans setlocale) without any issues.

       

      Regards,

      Vidya.

       

       

       

      -----Original Message-----
      From: Shawn Steele [mailto:Shawn.Steele@microsoft.com]
      Sent: Tuesday, October 25, 2005 11:37 PM
      To: Vidya Maheshwar Nabar; unicode@unicode.org
      Subject: RE: wide chars and methods

       

      Windows 2000 Server is natively Unicode. You cannot "hold all the characters" of that OS in an 8 bit char.

       

      The windows console is restricted to "ansi" code pages, which is probably why you're seeing the behavior you're seeing. Its strongly recommended that you avoid using ANSI applications and use Unicode instead.

       

      - Shawn

       

      SDE, Microsoft

       

    ------------------------------------------------------------------------------

      From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Vidya Maheshwar Nabar
      Sent: Tuesday, October 25, 2005 2:55 AM
      To: unicode@unicode.org
      Subject: wide chars and methods

       

       

      Hi,

       

      I wanted to know why 'wchar_t' data type is required if a 'char' can very well hold all the characters on a given OS. To elaborate, I run the program below on a Japanese Win 2K Server and pass Japanese strings:

       

      Code Snippet(VC++ 6.0):

      char str[MAX];

      cin >> str;

      cout << str << endl;

       

      Input:

      'Ü'¹'ñ

       

      Output:

      'Ü'¹'ñ

       

      Note: here, input is U+307E,U+305B and U+3093.

       

      The above program runs fine with chars and cin/cout/scanf/printf, in fact things go weird when I use wchars and wcin/wcout/wscanf/wprintf, it just doesn't output anything!

       

      How/Why is cin/cout/scanf/printf able to process Japanese strings on a Japanese machine with a char, and not wcin/wcout/wscanf/wprintf with wchar? Isn't that wchars/wide methods are needed for chars beyond the 8-bit range as char can't handle it? Am I missing something here?

       

      Thanks in advance.

       

      Regards,

      Vidya.

       



    This archive was generated by hypermail 2.1.5 : Sat Oct 29 2005 - 05:11:40 CST