RE: What's in a wchar_t string on unix?

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Mar 02 2004 - 11:18:55 EST

  • Next message: Frank Yung-Fong Tang: "Re: What's in a wchar_t string on unix?"


    Rick Cameron wrote on 3/1/2004, 4:59 PM:

    OK, I guess I need to be more precise in my question.
     
    For each of the popular unices (Solaris, HP-UX, AIX, and - if possible - linux), can anyone answer the following question:
     
    Assuming that the locale is set to Unicode, what is in a wchar_t string? Is it UTF-32 or pseudo-UTF-16 (i.e. UTF-16 code units, zero-extended to 32 bits)?
    Basically, the answer is very simple- the value is something you "should not know".  Why?

    One important thing about Object Oriented Design is Encapsulation. And wchar_t basically is a encapsulated data type that the caller should only interact with it through the defined public functions only, without assuming/knowing what it is. The public defined function include the following:
    size_t    mbstowcs(wchar_t *, const char *, size_t);
    int mbtowc(wchar_t *, const char *, size_t);
    size_t wcstombs(char *, const wchar_t *, size_t);
    int wctomb(char *, wchar_t);

    and also those functions listed in
    http://www.opengroup.org/onlinepubs/007908799/xsh/wchar.h.html

    Ask "what is in a wchar_t string" is like to ask "What does priv_var mean in

    public class myclass{
    ....
       private:
          int priv_var;
    };
    " for a caller who want to call myclass.
     
    I'm not expecting that there's single answer for all the unices of interest.
    There is one single answer- "Developers, except those who write the compiler code and the C Lib, should NOT know what is".
    And I'm well aware that our application can store in a wchar_t [] whatever it wants.
    NO. that is not true. "Application" cannot store whatever it want in a wcha_t[]. ANSI C standars basically say the "compiler vendor" or "OS vendor who also ship the compiler (which convert the L"" into wchar_t and implement those library functions above)" can store whatever it want into wchar_t[]. That does not mean "Application developer" can do that because the application developer have no control over how L"String" convert into wchar_t and no control over how to implement those wchar_t functions.
    I'm trying to find out what the O/S expects to be in a wchar_t string.
    the OS expect the wchar_t store the value which generated by wbstowcs or wbtowc.
     
    The reason we want to know this is that we want to be able to write a function that converts from UTF-8 (stored in a char []) to wchar_t [] properly. Obviously the function may need to behave differently on different flavours of unix.
    1. save your current locale
    2. setlocale to a UTF-8 locale
    3. call mbstowcs to convert the data into wchar_t*
    4. restore the locale back to your saved locale
     
    I am aware of the utility functions offered by TUC to perform conversions between UTF-8, UTF-16 and UTF-32. These functions do not handle the case of pseudo-UTF-16; which doesn't surprise me, since AFAIK it's not a conformant encoding form. Nonetheless, I have a string suspicion that some unices may use it.
     
    Cheers
     
    - rick cameron


    From: Frank Yung-Fong Tang [mailto:ytang0648@aol.com]
    Sent: March 1, 2004 12:48
    To: Rick Cameron
    Cc: unicode@unicode.org
    Subject: Re: What's in a wchar_t string on unix?

    I

    Rick Cameron wrote on 3/1/2004, 2:13 PM:

    Hi, all

    This may be an FAQ, but I couldn't find the answer on unicode.org.

    The reason is there are "NO answer" to the question you ask.

    It seems that most flavours of unix define wchar_t to be 4 bytes.

    Depend on which UNIX and which version. Depend on how you define "most flavours"

    If the locale is set to be Unicode, what's in a wchar_t string?

    No answer for that because
    1) ANSI C standard does not define it. (neither it's size nor it's content)
    2) Several organization try to establish standard for Unix. One of that is "The Open Group"'s "Base Specifications" IEEE Std 1003.1, 2003. But neither that define what should wchar_t hold.

    Is it UTF-32, or UTF-16 with the code units zero-extended to 4 bytes?

    Cheers

    - rick cameron

    The more interesting question is, why do you need to know the answer of your question. And the ANSI/C wchar_t model basically suggest, if you ask that question, you are moving to a wrong direction....







    This archive was generated by hypermail 2.1.5 : Tue Mar 02 2004 - 11:59:04 EST