From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Nov 22 2010 - 13:34:29 CST
Somya asked:
> I have unicode C application. I am using the following macro
> to define my string
> to 2 byte width characters.
>
> #ifdef UNICODE
> #define _T(x) L##x
>
> But I see that GCC compiler maps 'L' to wchar_t, which is 4 byte on Linux. I
> have used -fshort-wchar option
> on Linux but I want my application to be portable on AIX as
> well, which does not
> have this option. I am not able
> to findbest way to define _T(x) of UNICODE version, which takes 2 byte wide
> character always.
> Taking this, what is the best way to define _T(x) macro of UNICODE version, so
> that my strings will always be
> 2 byte wide character?
Well, some may disagree with me, but my first advice would be
to avoid macros like that altogether. And second, to absolutely
avoid any use of wchar_t in the context of processing Unicode
characters and strings.
If you are working with C compilers that support the C99 standard,
you can instead make use of the stdint.h exact-width integer
types. And then you should *typedef* Unicode code unit types
to those exact-width integer types.
uint8_t <-- typedef your UTF-8 code unit type to this
uint16_t <-- typedef your UTF-16 code unit type to this
uint32_t <-- typedef your UTF-32 code unit type to this
See:
http://en.wikipedia.org/wiki/Stdint.h
If you need to cross-compile on platforms that don't support
the C99 types, then you can probably get away with:
unsigned char
unsigned short
unsigned int
which should normally resolve to 8-bit, 16-bit, and 32-bit
types, respectively, on all platforms.
Once you have your 3 fixed-width code unit typedefs in hand,
do all of your Unicode character and string processing using
those types.
When you are making use of other Unicode libraries, the libraries
often have these typedefs already defined for you. Thus, for
example, ICU has typedefs for UChar (an unsigned 16-bit integer)
and UChar32 (as a signed 32-bit integer). [The choice between
a signed or unsigned 32-bit integer has to do with library
design choices, but in all cases the valid 32-bit values
for Unicode characters are in the positive range 0..0x10FFFF.]
See:
http://userguide.icu-project.org/strings
Once you have your code set up to use typedefs like this for
your Unicode characters and strings, read, understand, and
follow the rules for the UTF-8, UTF-16, and UTF-32 encoding
forms, as documented in Section 3.9, Unicode Encoding Forms,
of the Unicode Standard:
http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
and your Unicode string handling should then be correct
and conformant.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Nov 22 2010 - 13:37:05 CST