Unicode Mail List Archive: RE: What's in a wchar

RE: What's in a wchar_t string on unix?

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Mar 02 2004 - 11:18:55 EST

Next message: Frank Yung-Fong Tang: "Re: What's in a wchar_t string on unix?"

Previous message: 100272 (Harish Ramachandra Reddy): "Help needed ............."
In reply to: Rick Cameron: "RE: What's in a wchar_t string on unix?"
Next in thread: Antoine Leca: "Re: What's in a wchar_t string on unix?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Rick Cameron wrote on 3/1/2004, 4:59 PM:

OK, I guess I need to be more precise in my question.

For each of the popular unices (Solaris, HP-UX, AIX, and - if possible - linux), can anyone answer the following question:

Assuming that the locale is set to Unicode, what is in a wchar_t string? Is it UTF-32 or pseudo-UTF-16 (i.e. UTF-16 code units, zero-extended to 32 bits)?

Basically, the answer is very simple- the value is something you "should not know". Why?

One important thing about Object Oriented Design is Encapsulation. And wchar_t basically is a encapsulated data type that the caller should only interact with it through the defined public functions only, without assuming/knowing what it is. The public defined function include the following:

size_t    mbstowcs(wchar_t *, const char *, size_t);
int       mbtowc(wchar_t *, const char *, size_t);
size_t    wcstombs(char *, const wchar_t *, size_t);
int       wctomb(char *, wchar_t);

and also those functions listed in
http://www.opengroup.org/onlinepubs/007908799/xsh/wchar.h.html

Ask "what is in a wchar_t string" is like to ask "What does priv_var mean in

public class myclass{
....
private:
int priv_var;
};
" for a caller who want to call myclass.

I'm not expecting that there's single answer for all the unices of interest.

There is one single answer- "Developers, except those who write the compiler code and the C Lib, should NOT know what is".

And I'm well aware that our application can store in a wchar_t [] whatever it wants.

NO. that is not true. "Application" cannot store whatever it want in a wcha_t[]. ANSI C standars basically say the "compiler vendor" or "OS vendor who also ship the compiler (which convert the L"" into wchar_t and implement those library functions above)" can store whatever it want into wchar_t[]. That does not mean "Application developer" can do that because the application developer have no control over how L"String" convert into wchar_t and no control over how to implement those wchar_t functions.

I'm trying to find out what the O/S expects to be in a wchar_t string.

the OS expect the wchar_t store the value which generated by wbstowcs or wbtowc.

The reason we want to know this is that we want to be able to write a function that converts from UTF-8 (stored in a char []) to wchar_t [] properly. Obviously the function may need to behave differently on different flavours of unix.

1. save your current locale
2. setlocale to a UTF-8 locale
3. call mbstowcs to convert the data into wchar_t*
4. restore the locale back to your saved locale

I am aware of the utility functions offered by TUC to perform conversions between UTF-8, UTF-16 and UTF-32. These functions do not handle the case of pseudo-UTF-16; which doesn't surprise me, since AFAIK it's not a conformant encoding form. Nonetheless, I have a string suspicion that some unices may use it.

Cheers

- rick cameron

From: Frank Yung-Fong Tang [mailto:ytang0648@aol.com]
Sent: March 1, 2004 12:48
To: Rick Cameron
Cc: unicode@unicode.org
Subject: Re: What's in a wchar_t string on unix?

I

Rick Cameron wrote on 3/1/2004, 2:13 PM:

Hi, all

This may be an FAQ, but I couldn't find the answer on unicode.org.

The reason is there are "NO answer" to the question you ask.

It seems that most flavours of unix define wchar_t to be 4 bytes.

Depend on which UNIX and which version. Depend on how you define "most flavours"

If the locale is set to be Unicode, what's in a wchar_t string?

No answer for that because
1) ANSI C standard does not define it. (neither it's size nor it's content)
2) Several organization try to establish standard for Unix. One of that is "The Open Group"'s "Base Specifications" IEEE Std 1003.1, 2003. But neither that define what should wchar_t hold.

Is it UTF-32, or UTF-16 with the code units zero-extended to 4 bytes?

Cheers

- rick cameron

The more interesting question is, why do you need to know the answer of your question. And the ANSI/C wchar_t model basically suggest, if you ask that question, you are moving to a wrong direction....

Next message: Frank Yung-Fong Tang: "Re: What's in a wchar_t string on unix?"
Previous message: 100272 (Harish Ramachandra Reddy): "Help needed ............."
In reply to: Rick Cameron: "RE: What's in a wchar_t string on unix?"
Next in thread: Antoine Leca: "Re: What's in a wchar_t string on unix?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Mar 02 2004 - 11:59:04 EST