RE: How to print the byte representation of a wchar_t string with non -ASCII ...

From: Tay, William (William.Tay@usa.xerox.com)
Date: Fri Nov 02 2001 - 12:38:19 EST


Dear Unicoders & C gurus,

Thank you for your comments on my previous posting. They help. Have a
question while digesting them on machine, would appreciate your help.

At Solaris 2.6 shell prompt execute the program below by doing:
> setenv LC_ALL en_US.UTF-8
> a.out fôó

#include <stdio.h>, <stdlib.h>, <locale.h>, <wchar.h>

main(int argc, char* argv[])
{
   int i;
   wchar_t wstr[20];
   char mstr[20];

   setlocale(LC_ALL, ""); // char encoding is that of shell, i.e. UTF-8

   // MB: MultiByte; WC: WideChar
   printf("stdin in MB: %s, strlen: %d\n", argv[1], strlen(argv[1]));
   printf("Byte rep: ");
   for (i = 0; i < strlen(argv[1]); i++)
       printf("%02X ", argv[1][i]);

   mbstowcs(wstr, argv[1], 20);
   printf("stdin in WC: %ls, wcslen: %d\n", wstr, wcslen(wstr));
   // Guess this is the only way to see the byte rep of wstr string
   wcstombs(mstr, wstr, 20);
   printf("Byte rep: ");
   for (i = 0; i < strlen(mstr); i++)
       printf("%02X ", mstr[i]);

   wstr = L"fôó";
   mstr = "fôó";

   printf("App string in MB: %s, strlen: %d\n", mstr, strlen(mstr));
   printf("Byte rep: ");
   for (i = 0; i < strlen(mstr); i++)
       printf("%02X ", mstr[i]);

   printf("App string in WC: %ls, wcslen: %d\n", wstr, wcslen(wstr));
   // Guess this is the only way to see the byte rep of wstr string
   char mtemp[20];
   wcstombs(mtemp, wstr, 20);
   printf("Byte rep: ");
   for (i = 0; i < strlen(mtemp); i++)
       printf("%02X ", mtemp[i]);
}

Output:

stdin in MB: fôó, strlen: 5
Byte rep: 66 C3 B4 C3 B3

stdin in WC: fôó, wcslen: 3
Byte rep: 66 C3 B4 C3 B3

App string in MB: fôó, strlen: 3
Byte rep: 66 F4 F3

App string in WC: fôó, wcslen: 3
Byte rep: 66 C3 B4 C3 B3

---------------------

setlocale(LC_ALL, ""); I believe instructs the program to inherit the
encoding of the shell, i.e. UTF-8 in this example. In the 3rd case above,
shouldn't the result be the same as the 1st, since the string from stdin and
the program defined var are using the same encoding scheme?

Will

-----Original Message-----
From: Jungshik Shin [mailto:jshin@mailaps.org]
Sent: Thursday, November 01, 2001 3:11 PM
To: Unicode Mailing List
Subject: Re: How to print the byte representation of a wchar_t string
with non -ASCII ...

DougEwell2@cs.com wrote:

> In a message dated 2001-10-31 10:07:44 Pacific Standard Time,
> drepper@redhat.com writes:

>> This is wrong. wchar_t strings can of course be printed. Reading the
>> ISO C standard would tell you to use
>>
>> printf ("%ls", wstr);
>>
>> can be used to print wchar_t strings which are converted to a byte
>> stream according to the currently selected locale. Eventually it has

> But won't this approach fail as soon as we hit a 0x00 byte (i.e. the
> high 8 bits of any Latin-1 character)?

   I'm not sure what you're alluding to here. As long as
all characters in wstr belong to the repertoire of the encoding/
character set of the current locale (that is, unless one
passes wstr containing Chinese characters to printf() in,
say, de_DE.ISO8859-1 locale),
there should not be any problem with using '%ls' to
print out wstr with printf(). Of course, 'printf ("%ls", wstr) '
doesn't achieve what the original question asked for, but that
question has already been answered, hasn't it?

  fprintf() man page in Single Unix Spec v2 (perhaps,
I should look at the actual C standard) doesn't seem to say anything
about what to expect
when wstr contains characters outside the repertoire of
the character set of the current locale. wcrtomb() is called
for each wide char in wstr when '%ls' is used
to print out wstr. According wcrtomb() man page,
errno is set to EILSEQ if an invalid wide char.
is given to it, but it's not clear whether
'invalid wide char' in the man page of wcrtomb() includes
valid wide chars which are NOT convertible to the encoding
of the current locale.

  Jungshik Shin



This archive was generated by hypermail 2.1.2 : Fri Nov 02 2001 - 13:36:23 EST