RE: string vs. char [was Re: Java and Unicode]

From: Marco Cimarosti (marco.cimarosti@europe.com)
Date: Fri Nov 17 2000 - 04:39:06 EST


Addison P. Phillips wrote:
> I ended up deciding that the Unicode API for this OS will only work in
> strings. CTYPE replacement functions (such as isalpha) and
> character based
> replacement functions (such as strchr) will take and return
> strings for
> all of their arguments.
>
> Internally, my functions are converting the pointed character to its
> scalar value (to look it up in the database most efficiently).
>
> This isn't very satisfying. It goes somewhat against the grain of 'C'
> programming. But it's equally unsatisfying to use a 32-bit
> representation
> for a character and a 16-bit representation for a string,
> because in 'C',
> a string *is* an array of characters. Which is more
> natural? Which is more common? Iterating across an array of
> 16-bit values
> or

Actually, C does have different types for characters within strings and for
characters in isolation.

The type of a string literal (e.g. "Hello world!\n") is "array of char",
while the type of a character literal (e.g. 'H') is "int".

This distinction is generally reflected also in the C library, so that you
don't get compiler warnings when passing character constants to functions.

E.g., compare the following functions from <stdio.h>:

int fputs(const char * s, FILE * stream);
int fputc(int c, FILE * stream);

The same convention is generally used through the C library, not only in the
I/O functions. E.g.:

int isalpha(int c);
int tolower(int c);

This distinction has been retained also in the newer "wide character
library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
wide equivalent of "int".

The wide version of the examples above is:

int fputws(const wchar_t * c, FILE * stream);
wint_t fputwc(wint_t c, FILE * stream);

int iswalpha(wint_t c);
wint_t towlower(wint_t c);

In an Unicode implementation of the "wide character library" (wchar.h and
wctype.h), this difference may be exploited to use different UTF's for
strings and characters:

typedef unsigned short wchar_t;
/* UTF-16 character, used within string. */

typedef unsigned long wint_t;
/* UTF-32 character, used for handling isolated characters. */

But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
character in a couple of stupid APIs:

wchar_t * wcschr(const wchar_t * s, wchar_t c);
wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);

BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow"
ancestors: strchr() and strrchr().

But I think that changing those "wchar_t c" to "wint_t c" is a smaller
"violence" to the standards than changing them to "const wchar_t * c".
And you can also implement it in an elegant, quasi-standard way:

wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);
size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);

#ifdef PEDANTIC_STANDARD
wchar_t * wcschr(const wchar_t * s, wchar_t c);
wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
#else
#define wcschr _wcschr_32
#define wcsrchr _wcsrchr_32
#define wcrtomb _wcrtomb_32
#endif

I would like to see the opinion of C standardization experts (e.g. A. Leca)
about this forcing of the C standard.

_ Marco.

______________________________________________
La mia e-mail è ora: My e-mail is now:
>>> marco.cimarostiªeurope.com <<<
(Cambiare "ª" in "@") (Change "ª" to "@")
     

______________________________________________
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT