Re: string vs. char [was Re: Java and Unicode]

From: Marco Cimarosti (marco.cimarosti@europe.com)
Date: Mon Nov 20 2000 - 09:50:44 EST

Next message: Michael \(michka\) Kaplan: "Re: string vs. char [was Re: Java and Unicode]"
Previous message: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Maybe in reply to: addison@inter-locale.com: "string vs. char [was Re: Java and Unicode]"
Next in thread: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Reply: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Antoine Leca wrote:
> Marco Cimarosti wrote:
> > Actually, C does have different types for characters within
> strings and for
> > characters in isolation.
>
> That is not my point of view.
> There is a special case for 'H', that holds int type rather
> than char, for
> backward compatibility reasons (such as because the first
> versions of C were
> not able to deal correctly with to-be-promoted arguments).
> Similarly, a
> number of (old) functions use int for the character arguments.
> Then, there is the point of view that int represents _either_ a valid
> character, _or_ an error indication (EOF). This is the reason
> that makes
> int used for the return type of fgetc.

OK.

> Outside this, a string is clearly an array of characters, and
> characters are
> stored using the type char (or one of the sign alternatives).
> As a result,
> you can write 'H' either as such, or as "Hello, world!\n"[0].

OK.

> > The type of a string literal (e.g. "Hello world!\n") is
> "array of char",
> > while the type of a character literal (e.g. 'H') is "int".
> >
> > This distinction is generally reflected also in the C
> library, so that you
> > don't get compiler warnings when passing character
> constants to functions.
>
> You need not, since C considers character to be (small)
> integers, which eases
> passing of arguments. This is unrelated to the issue.

OK. I was just describing the background.

> > This distinction has been retained also in the newer "wide character
> > library": "wchar_t" is the wide equivalent of "char", while
> "wint_t" is the
> > wide equivalent of "int".
>
> Not exactly. The wide versions has the same distinction as
> the narrow one for
> the second case above (finding errors), but not for the first
> one (promoting).

OK.

> > The wide version of the examples above is:
> >
> > int fputws(const wchar_t * c, FILE * stream);
> > wint_t fputwc(wint_t c, FILE * stream);
> ^^^^^^
> Instead, we have
> int fputws(const wchar_t * s, FILE * stream);
> wint_t fputwc(wchar_t c, FILE *stream);
>
> It shows clearly that c cannot hold the WEOF value. OTOH, the
> returned value
> _can_ be the error indication WEOF, so the type is wint_t.

Oops! Sorry.

I had two versions of "wchar.h" at hand: one (lovingly crafted by myself)
had Ťwchar_t cť; the other one (shipped with Microsoft Visual C++ 6.0) had:

_CRTIMP wint_t __cdecl fputwc(wint_t, FILE *);

I did the mistake to trust the second one. :-)

> Similarly, the type of L'H' is wchar_t. You gave other
> examples in your "But".
>
> > int iswalpha(wint_t c);
>
> Here, the iswalpha is intended to be able to test valid
> characters as well
> as the error indication, so the type is wint_t; here WEOF is
> specifically allowed.

OK.

> > In an Unicode implementation of the "wide character
> library" (wchar.h and
> > wctype.h), this difference may be exploited to use
> different UTF's for
> > strings and characters:
>
> Ah, now we go into the interresting field.
> Please note that I left aside UTF-16, because I am not clear
> if 16-bit are
> adequate or not to code UTF-16 in wchar_t (in other words, if
> wchar_t can be
> a multiwide encoding).
>
> > typedef unsigned short wchar_t;
> > /* UTF-16 character, used within string. */
> >
> > typedef unsigned long wint_t;
> > /* UTF-32 character, used for handling isolated characters. */
>
> To date, no problem.
>
> > But, unluckily, there is a "but". Type "wchar_t" is *also*
> used for isolated
> > character in a couple of stupid APIs:
>
> See above for another example: fputwc...
>
> > But I think that changing those "wchar_t c" to "wint_t c"
> is a smaller
> > "violence" to the standards than changing them to "const
> wchar_t * c".
>
> ;-)

OK, my trick is dirty as well, just a bit easier to hide. ;-)

> > And you can also implement it in an elegant, quasi-standard way:
> <corrected>
> > wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
> > wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);
> > size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);
>
> What is the point? You cannot pass to these anything other than values
> between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really
> "interesting" ways to extend the meaning of these functions outside
> this range.
> Or do I miss something?

_wcschr_32 and _wcsrchr_32 would return a pointer to the first (or last)
occurrence of the specified character in the string, just like their
standard counterparts.

But if Ťc >= 0x1000ť, then the character would be represented in Ťsť (an
UTF-16 string) by a surrogate pair, and the function would thus return the
address of the *high surrogate*.

E.g., assuming that Ťsť is Ť{0x2190, 0xD800, 0xDC05, 0x2192, 0x0000}ť and
Ťcť is 0x1005, both functions would return Ť&s[1]ť: the address of the high
surrogate 0xD800.

Similarly for _wcrtomb_32(): assuming that Ťsť points into an UTF-8 string,
the function would insert in Ťsť the 3-octets UTF-8 sequence corresponding
to Ťcť.

> > #ifdef PEDANTIC_STANDARD
> > wchar_t * wcschr(const wchar_t * s, wchar_t c);
> > wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
> > size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
> > #else
> > #define wcschr _wcschr_32
> > #define wcsrchr _wcsrchr_32
> > #define wcrtomb _wcrtomb_32
> > #endif

Of course, the API's defined above are totally non-standard, and a
programmer who renames Ť_wcschr_32ť to the standard name Ťwcschrť is
deliberating violating the standard.

Moreover, in this hypothetical implementation, also the API's that maintain
a standard "look&feel" would actually behave in a non-standard way.

E.g., Ťwint_t getwc(FILE * stream)ť has a standard prototype but, assuming
Ťc = getwc(f)ť, it could be true that Ťc >= WCHAR_MAX && c != WEOFť. I.e.,
it could return a "valid" character that is greater than U+FFFF.

Such a non-standard implementation could only be used by programmers that
*know* the internals of the library and use it accordingly, because several
assumptions that are esplicitly granted by the standard are not granted by
this implementation.

E.g., a programmer can usually assume that, doing Ťp = wcschr(s, c)ť, it is
true that Ťp == NULL || *p == cť. In our implementation, this basic
assumption fails when Ťc >= 0x1000ť...

The only use for such an hackers' library would be to enable programmers to
implement *non*-standard UTF-32 applications that can be easily ported to a
*standard* *UCS-2* library (of course, loosing Astral Planes support in the
porting).

It cannot be used for the opposite process: an UCS-2 application built
according to the standard cannot be painlessly extended to support UTF-32
just by porting it to our criminal implementation.

I think that I have been very clumsy in the last few paragraphs, so I wish
to try and make myself understood imagining a possible "real"-life scenario.

Software house FooBar Inc. must write a Unicode application in C that has to
be shipped on several different markets.

They purchased a C compiler whose library is 100% standard, and has a good
supports for Unicode *UCS-2*. But this library is OK for all the languages
they have to support but one: Cantonese.

In their Cantonese version, in fact, they also need some new characters in
the Surrograte (aka "Astral") Planes.

So they decide to implement their quick-and-dirty substitution for the
compiler's wchar.lib and wctype.lib, using the technique described above.

They will use their in-house library just for building the Cantonese
release, but they want to use the standard libraty to build the release for
all other language -- because they plan to port the application to a
different OS, so they evaluate very much portability.

Of course, as soon as a release of their favorite compiler with "Astral
planes" support will hit the market, they will upgrade to it and send their
own library in the company's Museum of Hacks.

_ Marco

______________________________________________
La mia e-mail č ora: My e-mail is now:
>>> marco.cimarostiŞeurope.com <<<
(Cambiare "Ş" in "@") (Change "Ş" to "@")

______________________________________________
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup

Next message: Michael \(michka\) Kaplan: "Re: string vs. char [was Re: Java and Unicode]"
Previous message: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Maybe in reply to: addison@inter-locale.com: "string vs. char [was Re: Java and Unicode]"
Next in thread: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Reply: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT