Re: Backslash n [OT] was Line Separator and Paragraph Separator

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Oct 24 2003 - 07:12:36 CST


From: <jon@hackcraft.net>

> > > Still, I stand by saying that \n is defined in C++ as LF and \r as CR,
> > because
> > > that's sitting in front of me in black and white.
> >
> > Yes, true. But that does *not* mean that (int)'\n' can be counted on to
> > be 10
>
> Of course, given that any of a variety of character encodings could be in
use
> any guarantee that (int)'\n' == 10 would violate the definition of \n as
LF.

What is important in the standard is that the source author must assume that
'\n' will have the desired effect of terminating a line in text files, i.e.
the same
effect produced by LF in a Unix environment. There's no such requirement for
binary files (so this requirement does not apply to files open with the
standard
C library without the "t" flag), and only text files are required to support
the
conversions if necessary to keep that effect:

- in CP/M, DOS, OS/2, Windows, this is done by the standard library linked
with
the application, not by the OS.

- in MVS, VMS (and in some cases in NT with its optional support for
pluggable
foreign filesystems), this may be done by the OS itself.

- on Mac Classic, this is done by the compiler itself, which binds \n to the
LF
function (as defined by the language standard), where this LF is mapped to
13
in the Macintosh character set.

In any of these cases, the test "if ('\n' == 10)" will not necessarily be
true even
if the compiler is conforming to the C99 or ISO C++ standard: this is in the
gray
area where characters are promoted to integers, and where the C/C++
languages
are not very clear as they use simple integer promotion rules to represent
characters as integers, instead of separating them semantically (this gray
area
does not exist in Java, where bytes and chars are separate datatypes, and
where
the implicit numeric promotion is forbidden for chars: typecasting a char to
an
integer type explicitly is required, even if Java still allows chars to be
treated
as numeric with a defined but limited arithmetic on them).

I just think that it's a shame that the legacy usage of char as meaning a
byte in
C/C++ was an initial design error, but we have to live with it, due to the
huge
amount of programs that have been written assuming it. But this is still in
conformance with the initial design of C/C++ for performance, where a byte
(as an
integer type) is not even defined to have a defined bitwidth.

This causes problems in systems like 4-bit microcontrolers, where the
minimum
addressable (and allocatable) memory unit is the nibble: on them, a C
program
would have to assume that a char takes two nibbles, and thus two memory
cells,
so that an operation like c++ where c is a char would need to increment the
physical memory by 2: this would violate the usage of char as an integer
type, so
instead, the compiler will handle the conversion between integers and char*
using
a multiplication factor of 2, and differences of char* will include a
division by 2.
The problem with this scheme is that it becomes impossible to address a
single
memory nibble, except through another compiler-specific native datatype,
smaller
than a char, such as __int4 or __nibble.

The same problem occurs on systems where the memory or I/O space is
addressable
by 1-bit units: to support these systems (most often microcontrolers), the C
compiler needs to add support for a __bit datatype, and to handle the
conversions
between char* and __bit* pointers, notably when computing pointer
differences.

Whatever you think, all this should have been defined more precisely in
C/C++
standards, by designing two separate sets of datatypes, requiring explicit
rather than implicit conversions and promotion between them:

1) one set bound for performance or system integration, which maps
physically
addressable memory units, but without any requirement about the support
value
range, including the supported native floating point numbers (with their
full value
range and precision even if it is a superset or subset of the standard IEEE
formats);
for now C and C++ only define (though not completely) this set of datatypes,
with
various portability issues (the standard C datatype 'char' is among them,
and also
shamely the ANSI 'wchar_t' datatype).

2) one set bound for semantics, which maps enough addressable memory units
to support the standard ranges, and in which a "character" datatype (as
defined
by Unicode) could be designed, as well as the standard IEEE floating point
numbers, and all their expected values so that it becomes portable across
systems; Java only includes this set of datatypes, but most C/C++ compilers
come now with a set of include header mapping these standard types in terms
of native datatypes.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST