From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jan 10 2004 - 08:05:19 EST
From: "Deepak Chand Rathore" <deepakr@aztec.soft.net>
> Hi all,
>
> Compiler internal encoding might affect the encoding of the hardcoded
> literals with in a source file
> As a result after compilation we might interpret wrong characters.
> If we have hardcoded only ascii literals with in the program( source file)
> and left the compiler encoding to it's default;
> Is there any possibility that after compilation, in the object file
> produced, encoding of literals get affected.
> (as far i know almost all compiler's default encoding( default locale C in
> C++ ) is ascii compatible)
> this problem, i am refering wrt C++
> Are there any other issues related to this subject , any useful links
In C/C++, a char (or wchar_t) variable can also be used as an integer type.
This means that the compiler is not allowed to compile something else than
the specified integer constant (or the character constant coded with '\xHH'
or '\ooo').
However things are different for character or string constants (including
symbolic constants like '\n' but excluding occurences of '\xHH' and '\ooo'
in strings) that are interpreted and compiled into integers using a
compile-time conversion from the source charset to the run-time charset
(both of which are not necessarily ASCII-based). Such literals are handled
as symbolic to represent unspecified integer values.
If you think you must have exact numeric identity at run-time (independantly
of the source or run-time charset) for strings, then use constant arrays of
integer values or strings completely encoded with '\xNN' or '\ooo' instead
of symbolic literals like "Abcd" or L"ABCD" (i.e. encode them like
"\x41\x62\x63\x64" or L"\x41\x62\x63\x64").
Note that the '\uHHHH' or L'\UHHHHHH' notations uses a compile-time
conversion from the Unicode charset to the run-time charset. So there's no
guarantee that the following source-code assertions will be TRUE:
* ('\u0041' == 0x41) may be false
if the runtime charset (as specified or infered at compile-time) is
EBCDIC for example;
* (L'\U00000041' == 0x41) may be false
also for the same reason.
Note that a C/C++ compiler may support the '\uHHHH' or '\UHHHH' symbolic
notations but may refuse to compile it because there's a conversion error
(missing mapping) from Unicode to the runtime charset. This case happens for
example on Windows when not compiling for UNICODE, with the symbolic
literals '\u0080' or L'\U00000080' which unambiguously designate the first
C1 control character by its Unicode hexadecimal code point, but which does
not exist in the runtime ANSI or OEM charset (the runtime charset being
selected by a compiler option or by some compiler-specific pragmas.). So the
following may be FALSE:
* ('\u0041' == '\x41') may be false
if the runtime charset (as specified or infered at compile-time) is
EBCDIC for example;
* (L'\U00000041' == 'x41') may be false
also for the same reason;
And the following source-code assertions may be FALSE depending on compiler
capabilities and compilation options or pragmas (there's no consideration
here about the source or runtime charsets):
* ('\xFF' == 0xFF) and ('\377' == 0377) may be false
if the default char datatype is signed;
* ('\xFF' == -1) and ('\377' == -1) may be false
if the default char datatype is unsigned AND has no more than 8
bits.
* (L'\xFF' == 0xFF) and (L'\377' == 0377) may be false
if the default wchar_t datatype is signed AND has no more than 8
bits only;
* (L'\xFF' == -1) and (L'\377' == -1) may be false
if the default wchar_t datatype is unsigned.
But the following source-code assertions will be all TRUE (there's no
consideration here also about the source or runtime charsets):
* ('\x41' == 0x41) and (L'\x41' == 0x41) are true and will compile, if
the char datatype is at least 7 bits.
* ('\177' == 0177) and (L'\177' == 0177) are true and will compile, if
the char datatype is at least 8 bits.
* ('\0' == 0) and (L'\0' == 0) are always true and will always compile.
Note however that the following source-code assertions will all be TRUE
provided they compile:
* ('\u0041' == L'\U00000041') will be always true if it compiles.
* ('\u0041' == 'A') will be always true if it compiles.
* ('\U00000041' == 'A') will be always true if it compiles.
A source-code symbolic character literal like 'A' is not guaranteed to
compile (but it's unlikely that there's no character LATIN CAPITAL LETTER A
in the runtime charset), so be careful with some characters like '[' which
may not exist in all ISO-646 compatible run-time charsets.
This archive was generated by hypermail 2.1.5 : Sat Jan 10 2004 - 08:36:21 EST