Off topic: C Language (was RE: Multibyte definition)

From: Marco.Cimarosti@icl.com
Date: Tue Mar 21 2000 - 11:10:14 EST


John Cowan wrote:
(C Type 'char'...)
> Must have at least 8 bits.

Right.

The standard header <limits.h> (http://www.dinkum.com/htm_cl/limits.html)
defines a constant 'CHAR_BIT' that evaluates to the number of bits in a
'char'.

8 is the smallest value accepted by the standard (although, in practice, I
would be very surprised if any compiler in the world has a value different
than 8).

> sizeof(char) is guaranteed to be 1.

Right, by definition.

>Chars may be signed or unsigned, so
>the portable range is 0 to 127.
>Unsigned chars have a portable range
>of 0 to 255, fortunately.

To be really super portable, one should use the symbols in <limits.h>:

- 'signed char' is between 'SCHAR_MIN' and 'SCHAR_MAX';
- 'unsigned char' is between 0 and 'UCHAR_MAX';
- 'char' is identical to either 'signed char' or 'unsigned char', and ranges
between 'CHAR_MIN' and 'CHAR_MAX'.

Similarly, the standard header <wchar.h>
(http://www.dinkum.com/htm_cl/wchar.html) defines that:

- 'wchar_t' is between 'WCHAR_MIN' (that must be <= 'CHAR_MIN') and
'WCHAR_MAX' (that must be >= 'CHAR_MAX').

> > - "Multibyte string": [...]
> Terminated by a '\0'.
> > - "Wide string": [...]
> Terminated by a L'\0'.

You are probably right. In this case, I must change my definition of
"Multibyte character", because it does not require a null terminator:

Old:
> - "Multibyte character": a multibyte string containing only one character
> (in i18n terms), composed by one or more bytes.

New:
- "Multibyte character": an array of type 'char' (e.g. 'char mbchr
[MB_LEN_MAX] = { 0xC2, 0xB1 };') containing only one character (in i18n
terms), composed by one or more bytes.

The null-terminator is not required because all multibyte schemes have a way
to determine how many "trail bytes" follow a "lead byte". In the UTF-8
example above, the leftmost bits in the 0xC2 "lead byte" determine that only
one follows, while the leftmost bits in 0xB1 confirm that this value is a
valid "trail byte".

Finally, as this is an "off topic" posting, I'd like to attempt and
demonstrate that C language is so much "general purpose" than anything can
be written in it, including humor.

Find attached an *ASCII* implementation of ANSI C wide and multibyte
characters. I have submitted it to the ASCII Consortium
(http://www.ecs.soton.ac.uk/~rwb197/ascii), but it hasn't yet appeared on
their web site (the ASCII Editorial Committee is probably still balloting
about this :-).

Ciao.
Marco







This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT