From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jan 10 2004 - 16:59:04 EST
----- Original Message -----
From: "John Cowan" <cowan@mercury.ccil.org>
To: "Philippe Verdy" <verdy_p@wanadoo.fr>
Cc: <unicode@unicode.org>
Sent: Saturday, January 10, 2004 7:31 PM
Subject: Re: doubt
> Philippe Verdy scripsit:
>
> [much useful stuff snipped]
>
> > A source-code symbolic character literal like 'A' is not guaranteed to
> > compile (but it's unlikely that there's no character LATIN CAPITAL
LETTER A
> > in the runtime charset), so be careful with some characters like '['
which
> > may not exist in all ISO-646 compatible run-time charsets.
>
> There is a concept of the minimal runtime charset: it must include the
> ASCII letters and digits and some others.
This is needed to support the ANSI C library, but it is not I think, a
requirement in the language itself; by language I mean here its compiler on
a particular platform, that transforms a source file by interpreting it with
a source charset into a "runtime" charset.
Still the term "runtime" charset is quite confusive because it just
designates the charset to which strings and character constant in the source
file are converted to in the binary file, without meaning that the compiled
application will be effectively using that charset, or that this will be the
charset used in the environment where the compiled application will be run.
Not all platforms are able to represent in a single char every ASCII letter
and digits (or other characters in the "invariant" subset of ISO-646 and
EBCDIC), notably 4 bit microcontrolers, despite you could use C or C++ to
write software that would work on such limited platform.
I think that a more modern approach to support 4 bit controlers, or even a
new 32-bit or 64-bit processor that would allow to use bit-addressable
memory, would be to port the compiler so that sizeof(char) still equals 1,
without necessarily meaning that all adressable memory needs to be aligned
on char boundaries: so a point to char could as well use a physical
bit-address internally, where incrementing a char pointer in fact adds 8 to
the pointer.
The standard C/C++ libraries would work in such environment, because there's
no requirement for the required condition "sizeof(char)=1" meaning that the
physical address is incremented by 1, just the requirement that the "char"
datatype must be the minimum allocatable unit of memory when using
malloc()/free(), and that this datatype should be large enough to store at
least ASCII uppercase letters, digits and a few symbols (this means that a
"char" would need to be at least 6 bits).
Nothing forbids the compiler to add its own datatype for actual separately
addressable memory units smaller than a char; for example a "__bit" type
which would be in fact 1 bit only, and which could not be allocated with
malloc() and free(), and which would have these properties:
sizeof(__bit) == 0 (not allocatable by malloc()/free()), but
__bitsizeof(__bit) == 1, and
__bitsizeof(char) == 8, and
(char*)(charArray + 1) - (char*)(charArray) == 1 as expected, but also
(__bit*)(charArray + 1) - (__bit*)(charArray) == 8;
and with the possibility to create arbitrary pointers on this "smaller than
char" datatype. For safety, the processor could require some memory
alignments when handling data larger than a single memory unit. If
necessary, the "__size_t" datatype could be actually a fixed point number,
whose conversion from/to a standard integral typewould include a shift
operation, but that would allow defining:
__sizeof(__bit) == (__size_t)0.125
__sizeof(char) == (__size_t)1.000 == 8 * __bitsizeof(__bit)
So the minimum 8-bits needed to support char in most programs and library
would continue to work without modification in the source code, even if it's
an artificial construct of the compiler which hides the details of the way
addresses are internally computed.
On such platform, it would then still be possible to support a ASCII
compatible "runtime" charset and support as well UTF-8 encodings and other
classic Unicode encodings, such as UTF-16 used with a 16-bit "short"
definition of "wchar_t", or UTF-32 mapped to an 32-bit "long".
On a bit-addressable platform, a wchar_t datatype could be as well be
defined as a 21-bit single code unit if there's no alignment constraints for
reading/writing words made of multiple memory units with distinct addresses:
incrementing a wchar_t pointer would physically add 21 to that pointer...
There's lots of solutions for a compiler to maintain the preconditions on
chars needed to support ANSI C and a minimum "runtime" charset, even if the
platform allows accessing units smaller than a char.
This archive was generated by hypermail 2.1.5 : Sat Jan 10 2004 - 17:31:38 EST