L2/01-070

From: Karlsson Kent - keka [keka@im.se]
Sent: Friday, January 26, 2001 2:12 PM

Subject: Draft changes for wchar_t and Conformance sections

[Suggested new text at the end of this message]

> -----Original Message-----
> From: Sandra O'donnell USG [mailto:odonnell@zk3.dec.com]
...
> These are all optional types in C99, and I think many people

No, they are not quite that. Clause 7.18.1.1 says for <stdint.h>: "if a
implementation provides integer types with widths of 8, 16, 32, or 64 bits,
it **shall** define the corresponding typedef names." (my emphasis). And
the typedef names with the word "least" in the names are **required** for
8, 16, 32, and 64 bit least widths (clause 7.18.1.2), irrespective of
architecture (though you may then (very rarely) find 9 or more bits in a
uint_least8_t value representation).

> (including
> me!) would object to recommending such encoding-specific types. The
> code I have that currently uses char and cleanly handles
> UTF-8, Latin-1,
> eucJP, etc., would have to be revised to special-case UTF-8 using the
> uint8_t type, for example.

Though "char" (or better "unsigned char") can handle multiple encodings,
by locale, still:

1. There is special-casing: if the locale is a "UTF-8 locale", some
special functions can be used (like Markus Kuhn's wcwidth[_cjk]())
which work only for UTF-8.

2. Even though some things work with "agnostic" datatypes like char
   and wchar_t, not everything can be controlled from POSIX locale
   data, for a more ambitious Unicode implementation. Which is why
   some nail wchar_t to UCS-2/UTF-16, and others have other names for
   Unicode specific datatypes.

3. There is one such recommendation there already: UNICHAR; it would
   be more neutral to have UTF8_t, UTF16_t, and UTF32_t. I don't
   think it would be a good idea to have a UNICHAR which could sometimes
   be UTF-16, sometimes UTF-32, since those encodings need to be treated
   differently.

> I know we have differing opinions about the desirability of
> encoding-specific
> types, but I think this is waaaaay too controversial to put
> into a small
> section of the Unicode standard that's trying to describe wchar_t.
>
>    . . .
>          There is only one suggested "typedefed" name: UNICHAR,
>    for UTF-16 *code units*, not *characters*. Suggestion:
>    typedef uint8_t UTF8_t;
>    typedef uint16_t UTF16_t;
>    typedef uint32_t UTF32_t;
>    (I'm not really suggesting to write out those typedefs as
>    C code in the text, even though I did so here.)
>
> Again, I disagree. However, here I was trying simply to use
> as much of the
> existing text as possible. It refers to UNICHAR, so my
> revision did as well.
>
>          I'm not sure wchar_t can be used for UTF-16 and still
>    fully conform to C99. I have very much the impression that
>    wchar_t may be UCS-2, or UCS-4, or whatever else can hold
>    any character in the coded character set in a *single* code
>    unit, but not UTF-8 code units, nor UTF-16 code units.
>    (I cannot quote the C standard, since I don't have my
>    (only, paper, FDIS) copy handy).
>
> I almost mentioned that in my original message, but thought it might
> bring the wrath of Microsoft/IBM down upon me. I agree that UTF-16 is
> not useable as a wchar_t encoding because wchar_t is a FIXED WIDTH
> encoding and UTF-16 clearly is not. UCS-2 is okay in wchar_t,
> of course.
>
> It's just that I know that it's a tricky subject for the vendors
> who chose 16-bits as their wchar_t. I mean, if wchar_t is 16 bits
> and wc happens to contain the first half of a surrogate pair, what
> are functions like iswalpha(), wcwidth(), wcstombs(), towupper(),
> etc., supposed to do?

Well, the C (and COBOL?) restrictions are broken anyway, as Mark says.
to[w]upper('ß') should return "SS", but it can't. More examples in
SpecialCasing.txt. I see no problem in MS having wchar_t be UTF-16
code units; even though it strictly speaking does not follow the C99
standard.

> This may be worth discussing, but I don't want to lose sight of the
> existing inaccuracy in R3.0 that says it is NOT Unicode-conformant to
> use a 32-bit wchar_t. I want to make sure that gets fixed.

I don't think it is all that a good idea to have UTF-32 for strings;
though it is useful for isolated characters, and interrogation functions
on isolated characters. For toupper on an isolated character, it still
does not work properly.

> I'm just not
> sure we can have a productive discussion about UTF-16 as
> wchar_t; it tends
> to cause a lot of heat.
>
> Regards,
> -- Sandra
> -----------------------
> Sandra Martin O'Donnell
> Compaq Computer Corporation
> sandra.odonnell@compaq.com
> odonnell@zk3.dec.com

======================================================================
[The following is VERY brief; but I don't want to make it any longer
than the current 5.2. I've deleted a lot of old text that I find
tangential in such a short piece, or that I find to be moot. I've
added some text about 'char' since that is what Linux folks prefer
for UTF-8, as well as more about typedefs and conditional
compilation. I hope I have covered what Sandra wanted to cover,
even though I have made some rather thorough changes to Sandra's
suggested text.]

Suggested new text for 5.2:
----------------------------------------------------------------

5.2 Datatypes for Unicode
=========================

Unicode code units (that singly or in sequence represent
a Unicode character) need to be represented in some datatype
in programming languages. Some programming languages may
have predefined types (or classes) also for Unicode strings.

___Java___

The datatype 'char' in Java is for
representing UTF-16 code units, though initially only for
the BMP. The datatype 'int' is sometimes used to represent
a UTF-16 code unit. [This may be a UTF-32 code unit later on...;
I don't know what the plans are for Java.]

A 'char' array, 'char[]', can be used to represent a null
terminated UTF-16 string.

The Java class String can also represent a Unicode string.

___C99 (ISO/IEC 9899:1999) and C++ (ISO/IEC 14882:1995)___

The datatype 'unsigned char' (or less stringently, 'char')
can be used for various byte oriented character encodings,
including multibyte character encodings like UTF-8. However,
functions such as 'isalpha' will work only where a character
can be represented in a single code unit, as do functions like
'toupper' that in addition will give a proper result only
if it fits as a single character in a single byte. The datatype
'int' is sometimes used to represent a value of type 'char'.

The datatype 'wchar_t' (in '<wchar.h>') can be used for various "wide"
character encodings; C and C++ leave the semantics of 'wchar_t' to the
specific implementation. 'wchar_t' may be for Unicode in some compilers,
e.g. for UTF-16 or for UTF-32. The width of 'wchar_t' is compiler-specific
and can be as little as 8 bits, and even if wider it need not be Unicode.
Consequently, programs that need to be portable across any C or C++
compiler should not use 'wchar_t' for storing Unicode text. The datatype
'wint_t' is sometimes used to represent a value of type 'wchar_t'.

However, programmers can use (one or more) typedefs for Unicode
code units. E.g., one can define 'UTF8_t' to be 'uint8_t', 'UTF16_t'
to be 'uint16_t', or 'UTF32_t' to be 'uint32_t'. The last one is
particularly useful for single code point property interrogation
functions. 'uint[N]_t' for N being 8, 16, 32, or 64 are defined in
'<stdint.h>' for C99 for all computer architectures that natively have
those data widths. Or one can transiently use the 'uint_least[N]_t' or
'uint_fast[N]_t' datatypes, that are provided in all C99 implementations.
Further, programmers can use conditional compilation to choose between
different 'typedef's for the same Unicode code unit name depending
on platform.
===================================================================

I agree with Sandra on the suggested changes to the conformance
section.

/kent k