Re: UNICODE version of _T(x) macro

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Nov 22 2010 - 13:08:37 CST

Next message: Kenneth Whistler: "Re: UNICODE version of _T(x) macro"

Previous message: Asmus Freytag: "Re: Are Latin and Cyrillic essentially the same script?"
In reply to: Phillips, Addison: "RE: UNICODE version of _T(x) macro"
Next in thread: Asmus Freytag: "Re: UNICODE version of _T(x) macro"
Reply: Asmus Freytag: "Re: UNICODE version of _T(x) macro"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/22/2010 10:18 AM, Phillips, Addison wrote:
>> sowmya satyanarayana<sowmya underscore satyanarayana at yahoo dot
>> com>
>> wrote:
>>
>>> Taking this, what is the best way to define _T(x) macro of
>> UNICODE version, so
>>> that my strings will always be
>>> 2 byte wide character?
>> Unicode characters aren't always 2 bytes wide. Characters with
>> values
>> of U+10000 and greater take two UTF-16 code units, and are thus 4
>> bytes
>> wide in UTF-16.
>>
> Not exactly. The code units for UTF-16 are always 16-bits wide. Supplementary characters (those with code points>= U+10000) use a surrogate pair, which are two 16-bit code units. Most processing and string traversal is in terms of the 16-bit code units, with a special case for the surrogate pairs.
>
> It is very useful when discussing Unicode character encoding forms to distinguish between characters ("code points") and their in memory representation ("code units"), rather than using non-specific terminology such as "character".
>
> If you want to use UTF-32, which uses 32-bit code units, one per code point, you can use a 32-bit data type instead. Those are always 4 bytes wide.

The question is relevant to the C and C++ languages.

What is asked: which native data type to I use to make sure I end up
with a 16-bit code unit.

The usual way a _T macro is used is

TCHAR x = _T('x');
TCHAR * x = _T("x");

that is to wrap a string or character literal so that it can be used
either as Unicode literal or as non-Unicode literal, depending on
whether some global compile time flat (usually UNICODE or _UNICODE) is
set or not.

The usual way a _T macro is defined is something like:

#ifdef UNICODE
#define _T(x) L##x
#else
#define _T(x) x
#endif

That defintion relies on the compiler to support L'x' or L"string" by
using UTF-16.

A few years ago, there was a proposal to amend the C standard to have a
way to ensure that this is the case in a cross platform way. I can't
recall offhand what became of it.

A./

Next message: Kenneth Whistler: "Re: UNICODE version of _T(x) macro"
Previous message: Asmus Freytag: "Re: Are Latin and Cyrillic essentially the same script?"
In reply to: Phillips, Addison: "RE: UNICODE version of _T(x) macro"
Next in thread: Asmus Freytag: "Re: UNICODE version of _T(x) macro"
Reply: Asmus Freytag: "Re: UNICODE version of _T(x) macro"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 22 2010 - 13:10:56 CST