Re: UNICODE version of _T(x) macro

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Nov 22 2010 - 13:34:29 CST

  • Next message: Asmus Freytag: "Re: UNICODE version of _T(x) macro"

    Somya asked:

    > I have unicode C application. I am using the following macro
    > to define my string
    > to 2 byte width characters.
    >
    > #ifdef UNICODE
    > #define _T(x) L##x
    >
    > But I see that GCC compiler maps 'L' to wchar_t, which is 4 byte on Linux. I
    > have used -fshort-wchar option
    > on Linux but I want my application to be portable on AIX as
    > well, which does not
    > have this option. I am not able
    > to findbest way to define _T(x) of UNICODE version, which takes 2 byte wide
    > character always.

    > Taking this, what is the best way to define _T(x) macro of UNICODE version, so
    > that my strings will always be
    > 2 byte wide character?

    Well, some may disagree with me, but my first advice would be
    to avoid macros like that altogether. And second, to absolutely
    avoid any use of wchar_t in the context of processing Unicode
    characters and strings.

    If you are working with C compilers that support the C99 standard,
    you can instead make use of the stdint.h exact-width integer
    types. And then you should *typedef* Unicode code unit types
    to those exact-width integer types.

    uint8_t <-- typedef your UTF-8 code unit type to this

    uint16_t <-- typedef your UTF-16 code unit type to this

    uint32_t <-- typedef your UTF-32 code unit type to this

    See:

    http://en.wikipedia.org/wiki/Stdint.h

    If you need to cross-compile on platforms that don't support
    the C99 types, then you can probably get away with:

    unsigned char

    unsigned short

    unsigned int

    which should normally resolve to 8-bit, 16-bit, and 32-bit
    types, respectively, on all platforms.

    Once you have your 3 fixed-width code unit typedefs in hand,
    do all of your Unicode character and string processing using
    those types.

    When you are making use of other Unicode libraries, the libraries
    often have these typedefs already defined for you. Thus, for
    example, ICU has typedefs for UChar (an unsigned 16-bit integer)
    and UChar32 (as a signed 32-bit integer). [The choice between
    a signed or unsigned 32-bit integer has to do with library
    design choices, but in all cases the valid 32-bit values
    for Unicode characters are in the positive range 0..0x10FFFF.]

    See:

    http://userguide.icu-project.org/strings

    Once you have your code set up to use typedefs like this for
    your Unicode characters and strings, read, understand, and
    follow the rules for the UTF-8, UTF-16, and UTF-32 encoding
    forms, as documented in Section 3.9, Unicode Encoding Forms,
    of the Unicode Standard:

    http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf

    and your Unicode string handling should then be correct
    and conformant.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Nov 22 2010 - 13:37:05 CST