RE: What's in a wchar_t string ...

From: Winkler, Arnold F ([email protected])
Date: Thu Mar 04 2004 - 08:21:46 EST

Next message: Rick Cameron: "RE: What's in a wchar_t string on unix?"

Previous message: Edward H. Trager: "RE: SVG Fonts - Is it the Font Standard of the future?"
Next in thread: Antoine Leca: "Re: What's in a wchar_t string ..."
Reply: Antoine Leca: "Re: What's in a wchar_t string ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Folks,

Since "ISO/IEC 9899 - Programming Language C" was quoted, I wonder if
you are aware of the efforts of SC22/WG14 to develop a Technical Report
that deals with the problems discussed in this thread.

The document is ISO/IEC DTR 19769 - Extensions for the programming
language C to support new character data types

The project is currently in DTR ballot and will, when approved,
certainly take some time to be implemented in C-compilers and in
operating systems. But it gives a good indication, in which direction
the formal standardization is going with data types in C language.

Here are some excerpts from the DTR 19769:

Quote:
3 The new typedefs
This Technical Report introduces the following two new typedefs,
char16_t and
char32_t :
typedef T1 char16_t;
typedef T2 char32_t;
where T1 has the same type as uint_least16_t and T2 has the same type as
uint_least32_t.
The new typedefs guarantee certain widths for the data types, whereas
the width of
wchar_t is implementation defined. The data values are unsigned, while
char and
wchar_t could take signed values.
This Technical Report also introduces the new header:
<uchar.h>
The new typedefs, char16_t and char32_t, are defined in <uchar.h>

4 Encoding
C99 subclause 6.10.8 specifies that the value of the macro _
_STDC_ISO_10646_ _
shall be "an integer constant of the form yyyymmL (for example,
199712L), intended
to indicate that values of type wchar_t are the coded representations of
the
characters defined by ISO/IEC 10646, along with all amendments and
technical
corrigenda as of the specified year and month." C99 subclause 6.4.5p5
specifies that wide string literals are initialized with a sequence of
wide characters as defined by the mbstowcs function with an
implementation-defined current locale. Analogous to this macro, this
Technical Report introduces two new macros.

If the header <uchar.h> defines the macro _ _STDC_UTF_16_ _, values of
type
char16_t shall have UTF-16 encoding. This allows the use of UTF-16 in
char16_t
even when wchar_t uses a non-Unicode encoding. In certain cases the
compile-time
conversion to UTF-16 may be restricted to members of the basic character
set and
universal character names (\Unnnnnnnn and \unnnn) because for these the
conversion
to UTF-16 is defined unambiguously.

If the header <uchar.h> defines the macro _ _STDC_UTF_32_ _, values of
type
char32_t shall have UTF-32 encoding.

If the header <uchar.h> does not define the macro _ _STDC_UTF_16_ _, the
encoding of char16_t is implementation defined. Similarly, if the header
<uchar.h> does not define the macro _ _STDC_UTF_32_ _, the encoding of
char32_t is implementation defined.

An implementation may define other macros to indicate a different
encoding.
Unquote

The document, which of course is copyrighted by ISO, starts with a nice
introduction that defines the problem. In addition to the excerpts
above, it also addresses the following subjects:
5 String literals and character constants
5.1 String literals and character constants notations
5.2 The string concatenation
6 Library functions
6.1 The mbrtoc16 function
6.2 The c16rtomb function
6.3 The mbrtoc32 function
6.4 The c32rtomb function
7 ANNEX A Unicode encoding forms: UTF-16, UTF-32

Best regards
Arnold

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Nelson H. F. Beebe
Sent: Wednesday, March 03, 2004 1:49 PM
To: [email protected]
Cc: [email protected]
Subject: Re: What's in a wchar_t string ...

"Frank Yung-Fong Tang" <[email protected]> asks on Wed, 3 Mar 2004
12:38:49
-0500:

>> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is
defined?
>> or does it only mean wchar_t hold the character in ISO_10646
>> (which mean it could be 2 bytes, 4 bytes or more than that?)

Here is the exact text from

        INTERNATIONAL ISO/IEC STANDARD 9899
        Second edition
        1999-12-01
        Programming languages -- C

>> ...
>> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
>> example, 199712L), intended to indicate
>> that values of type wchar_t are the coded
>> representations of the characters defined
>> by ISO/IEC 10646, along with all amendments
>> and technical corrigenda as of the
>> specified year and month.
>> ...

It says nothing more about the size of wchar_t, or what encodings are
used: note the vague language "coded representations...". This means
effectively that the implementation, not the Standard, decides.

Very few current Unix C or C++ compilers even define the symbol
__STDC_ISO_10646__; the C/C++ feature test package at

ftp://ftp.math.utah.edu/pub/features
http://www.math.utah.edu/pub/features

probes that macro value, and many others.

My logs of its runs in about 90 build environments show definitions
with values 200009 for GNU gcc versions 3.x (all platforms), Intel icc
versions 7.x and 8.0 (Intel IA-32 and IA-64), and Portland Group pgcc
versions 4.x and 5.x (Intel IA-32). On all of these, it reports that
sizeof(wchar_t) = 4, but of course, that says nothing whatever about
the encoding.

------------------------------------------------------------------------
-------
- Nelson H. F. Beebe Tel: +1 801 581 5254
-
- University of Utah FAX: +1 801 581 4148
-
- Department of Mathematics, 110 LCB Internet e-mail:
[email protected] -
- 155 S 1400 E RM 233 [email protected]
[email protected] -
- Salt Lake City, UT 84112-0090, USA URL:
http://www.math.utah.edu/~beebe -
------------------------------------------------------------------------
-------

Next message: Rick Cameron: "RE: What's in a wchar_t string on unix?"
Previous message: Edward H. Trager: "RE: SVG Fonts - Is it the Font Standard of the future?"
Next in thread: Antoine Leca: "Re: What's in a wchar_t string ..."
Reply: Antoine Leca: "Re: What's in a wchar_t string ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 04 2004 - 12:36:46 EST