From: Karlsson Kent - keka [keka@im.se]
Sent: Friday, January 26, 2001 2:12 PM
Subject: Draft changes for wchar_t
and Conformance sections
[Suggested new text at the end of this message]
> -----Original Message-----
> From: Sandra O'donnell USG [mailto:odonnell@zk3.dec.com]
...
> These are all optional types in C99, and I
think many people
No, they are not quite that. Clause
7.18.1.1 says for <stdint.h>: "if a
implementation provides integer types with
widths of 8, 16, 32, or 64 bits,
it **shall** define the corresponding typedef
names." (my emphasis). And
the typedef names with the word
"least" in the names are **required** for
8, 16, 32, and 64 bit least widths (clause
7.18.1.2), irrespective of
architecture (though you may then (very rarely)
find 9 or more bits in a
uint_least8_t value representation).
> (including
> me!) would object to recommending such
encoding-specific types. The
> code I have that currently uses char and
cleanly handles
> UTF-8, Latin-1,
> eucJP, etc., would have to be revised to
special-case UTF-8 using the
> uint8_t type, for example.
Though "char" (or better
"unsigned char") can handle multiple encodings,
by locale, still:
1. There is special-casing: if the locale is
a "UTF-8 locale", some
special functions can be used (like
Markus Kuhn's wcwidth[_cjk]())
which work only for UTF-8.
2. Even though some things work with
"agnostic" datatypes like char
and wchar_t, not everything can be
controlled from POSIX locale
data, for a more ambitious Unicode
implementation. Which is why
some nail wchar_t to UCS-2/UTF-16,
and others have other names for
Unicode specific datatypes.
3. There is one such recommendation there
already: UNICHAR; it would
be more neutral to have UTF8_t,
UTF16_t, and UTF32_t. I don't
think it would be a good idea to
have a UNICHAR which could sometimes
be UTF-16, sometimes UTF-32, since
those encodings need to be treated
differently.
> I know we have differing opinions about
the desirability of
> encoding-specific
> types, but I think this is waaaaay too
controversial to put
> into a small
> section of the Unicode standard that's
trying to describe wchar_t.
>
> . . .
>
There is only one suggested "typedefed" name: UNICHAR,
> for UTF-16 *code units*,
not *characters*. Suggestion:
> typedef uint8_t UTF8_t;
> typedef uint16_t UTF16_t;
> typedef uint32_t UTF32_t;
> (I'm not really
suggesting to write out those typedefs as
> C code in the text, even
though I did so here.)
>
> Again, I disagree. However, here I was
trying simply to use
> as much of the
> existing text as possible. It refers to
UNICHAR, so my
> revision did as well.
>
>
I'm not sure wchar_t can be used for UTF-16 and still
> fully conform to C99.
I have very much the impression that
> wchar_t may be UCS-2, or
UCS-4, or whatever else can hold
> any character in the
coded character set in a *single* code
> unit, but not UTF-8 code
units, nor UTF-16 code units.
> (I cannot quote the C
standard, since I don't have my
> (only, paper, FDIS) copy
handy).
>
> I almost mentioned that in my original
message, but thought it might
> bring the wrath of Microsoft/IBM down upon
me. I agree that UTF-16 is
> not useable as a wchar_t encoding because
wchar_t is a FIXED WIDTH
> encoding and UTF-16 clearly is not. UCS-2
is okay in wchar_t,
> of course.
>
> It's just that I know that it's a tricky
subject for the vendors
> who chose 16-bits as their wchar_t. I mean,
if wchar_t is 16 bits
> and wc happens to contain the first half of
a surrogate pair, what
> are functions like iswalpha(), wcwidth(),
wcstombs(), towupper(),
> etc., supposed to do?
Well, the C (and COBOL?) restrictions are
broken anyway, as Mark says.
to[w]upper('ß') should return "SS",
but it can't. More examples in
SpecialCasing.txt. I see no problem in MS
having wchar_t be UTF-16
code units; even though it strictly speaking
does not follow the C99
standard.
> This may be worth discussing, but I
don't want to lose sight of the
> existing inaccuracy in R3.0 that says it is
NOT Unicode-conformant to
> use a 32-bit wchar_t. I want to make sure
that gets fixed.
I don't think it is all that a good idea to
have UTF-32 for strings;
though it is useful for isolated characters, and
interrogation functions
on isolated characters. For toupper on an
isolated character, it still
does not work properly.
> I'm just not
> sure we can have a productive discussion
about UTF-16 as
> wchar_t; it tends
> to cause a lot of heat.
>
>
Regards,
>
-- Sandra
> -----------------------
> Sandra Martin O'Donnell
> Compaq Computer Corporation
> sandra.odonnell@compaq.com
> odonnell@zk3.dec.com
======================================================================
[The following is VERY brief; but I don't want
to make it any longer
than the current 5.2. I've deleted a lot of old
text that I find
tangential in such a short piece, or that I find
to be moot. I've
added some text about 'char' since that is what
Linux folks prefer
for UTF-8, as well as more about typedefs and
conditional
compilation. I hope I have covered what
Sandra wanted to cover,
even though I have made some rather thorough
changes to Sandra's
suggested text.]
Suggested new text for 5.2:
----------------------------------------------------------------
5.2 Datatypes for Unicode
=========================
Unicode code units (that singly or in
sequence represent
a Unicode character) need to be represented in
some datatype
in programming languages. Some programming
languages may
have predefined types (or classes) also for
Unicode strings.
___Java___
The datatype 'char' in Java is for
representing UTF-16 code units, though initially
only for
the BMP. The datatype 'int' is sometimes
used to represent
a UTF-16 code unit. [This may be a UTF-32 code
unit later on...;
I don't know what the plans are for Java.]
A 'char' array, 'char[]', can be used to
represent a null
terminated UTF-16 string.
The Java class String can also represent a Unicode string.
___C99 (ISO/IEC 9899:1999) and C++ (ISO/IEC 14882:1995)___
The datatype 'unsigned char' (or less
stringently, 'char')
can be used for various byte oriented character
encodings,
including multibyte character encodings like
UTF-8. However,
functions such as 'isalpha' will work only where
a character
can be represented in a single code unit, as do
functions like
'toupper' that in addition will give a proper
result only
if it fits as a single character in a single
byte. The datatype
'int' is sometimes used to represent a value of
type 'char'.
The datatype 'wchar_t' (in '<wchar.h>')
can be used for various "wide"
character encodings; C and C++ leave the
semantics of 'wchar_t' to the
specific implementation. 'wchar_t' may be for
Unicode in some compilers,
e.g. for UTF-16 or for UTF-32. The width
of 'wchar_t' is compiler-specific
and can be as little as 8 bits, and even if
wider it need not be Unicode.
Consequently, programs that need to be portable
across any C or C++
compiler should not use 'wchar_t' for storing
Unicode text. The datatype
'wint_t' is sometimes used to represent a value
of type 'wchar_t'.
However, programmers can use (one or more)
typedefs for Unicode
code units. E.g., one can define 'UTF8_t'
to be 'uint8_t', 'UTF16_t'
to be 'uint16_t', or 'UTF32_t' to be 'uint32_t'.
The last one is
particularly useful for single code point
property interrogation
functions. 'uint[N]_t' for N being 8, 16,
32, or 64 are defined in
'<stdint.h>' for C99 for all computer
architectures that natively have
those data widths. Or one can transiently use
the 'uint_least[N]_t' or
'uint_fast[N]_t' datatypes, that are provided in
all C99 implementations.
Further, programmers can use conditional
compilation to choose between
different 'typedef's for the same Unicode code
unit name depending
on platform.
===================================================================
I agree with Sandra on the suggested changes
to the conformance
section.
/kent k