From: Antoine Leca (Antoine10646@Leca-Marti.org)
Date: Tue Mar 02 2004 - 05:50:55 EST
Rick Cameron asked:
> It seems that most flavours of unix define wchar_t to be 4 bytes.
As your "most" suggests, this is not universal. What if it is 8-byte? ;-)
> If the locale is set to be Unicode,
That part is highly suspect.
Since you write that, you already know the wchar_t encoding (as well as char
one) depends on the locale setting. Few person has this right. So you then
also know that "wchar_t is implementation defined" in all the relevant
standards (ANSI, C99, POSIX, SUS). In other words, this says, answer is in
the documentation for YOUR implementation.
Now, we can try to guess. But there are only guesses.
> what's in a wchar_t string? Is it UTF-32, or UTF-16 with the code units
zero-extended to 4 bytes?
The later is an heresy. Nobody should be fool enough to have this. UCS-2
with the code units zero-extended to 4 bytes might be an option, but if a
implementor has support for UTF-16, why would she store extended UTF-16 (in
whatever form, i.e. split or joined, 4 or 8 bytes) in wchar_t? Any evidence
of this would be a severe bug, IMHO.
Back to your original question, and assuming "the locale is set to be
Unicode", there is as much possibility to encounter UTF-32 values (which
would mean the implementation does have Unicode 3.1 support) than
zero-extended UCS-2 (case of a pre-3.1 Unicode implementation). Other values
would be very strange, IMHO.
Recent standards has a test feature macro, __STDC_ISO_10646__, that if
defined will tell you the answer: defined to be greater than 1999xxL will
mean UTF-32 values. Defined but less than 1999xxL will probably mean no
surrogate support, hence zero-extended UCS-2. Undefined does not tell you
anything.
Unfortunately, this is also the most current setup.
Frank Yung-Fong Tang answered
> The more interesting question is, why do you need to know the
> answer of your question. And the ANSI/C wchar_t model basically
> suggest, if you ask that question, you are moving to a wrong direction....
I am not that sure. I agree that the wchar_t model is basically a dead end
nowadays. But until the new model (char16_t, char32_t) get formalized and
implementated, it is better than nothing, since implementers did try to have
it right. Depending of your degree of conformance required, and also of the
allowance you give to having to bring in something heavy (this could rule
out ICU, for instance), then the minimalistic wchar_t support might help.
Philippe Verdy wrote:
> What's in a wchar_t string on unix?What you'll put or find in wchar_t
> is application dependant.
Disagreee. The result of mbtowc is NOT application dependant. It is rather
implementation dependant, which might be rather more disturbing...
> But there's only a guarantee to find a single
> code unit (not necessarily a codepoint) for characters encoded in the
> source and compiled with the appropriate source charset.
Can't parse that.
> But this charset is not necessarily Unicode.
This, you know at the moment you are compiling (this is not the same of the
result of using the library function, by the way).
> At run-time, functions in the standard libraries that work with or
> return wide strings only expect these strings to be encoded
> according to the current locale (not necessarily Unicode).
> So if you run your program in an environment where the locale is
> ISO-8859-2,
... you are answering something completely opposed from what he asked, since
it specified
: > If the locale is set to be Unicode,
> you'll find code units whose value between 0 and 255 match their
> position in the ISO-8859-2 standard,
That is wrong. When "your locale is ISO-8859-2" (whatever that may really
meant), you know next to nothing to encoding used for wchar_t. It might be
ISO-8859-2 (case of the degenerate case when wchar_t == char), it might be
Unicode (best probability on Unix if wchar_t is 4 bytes), or it might even
something very different like a flat EUC-XX (on some East-Asian flavour of
Unix). Only thing you know for sure, it is not EBCDIC!
> A wchar_t can then be used with any charset whose minimum code unit size
is
> lower than or equal to the size of the wchar_t type.
Wrong again. "any" is too strong. There are many charsets that while being
"smaller" than some other, cannot be shoe-horsed to enter into th encopding
of the wider form. For example, is wchar_t is 2 bytes and hold values
according to EUC-JP, you cannot encode Big-5 or ISCII with it, even if the
minimum code size is equal or even less: this is because all needed
codepoints are not defined in EUC-JP.
Unicode among its properties, does have the one to encompass all existing
charsets, so it aims at satisfying the property you spelled. But the mere
fact it is an objective of Unicode should show that all other existing
charsets do not satisfy the property.
> wchar_t is then only convenient for Unicode,
I cannot see from what you are inferring this.
> However a "wide" string constant (of type wchar_t*) should be able
> to store and represent any Unicode character or codepoint,
> possibly by mapping one codepoint to several wchar_t code units...
This is specifically prohibited.
The very point of wchar_t was to avoid the multibyte stuff. So if you
support Unicode 3.1 (surrogates), you are required to have 21-bits or more
wchar_t. 16-bits wchar_t limit you ipso facto to 3.0 support.
I confirmed this various times with the C comittee, because I wanted if any
possible to qualify existing 16-bit wchar_t implementations to make them
able to use the __STDC_ISO_10646__ feature (to indicate e.g. Philippine
script support). The comittee made very clear this is not possible.
> Unlike Java's "char" type which is always an unsigned 16-bit integer
> on all platforms, there's no standard size for wchar_t in C and C++...
After all, this is correct.
Rick Cameron then wrote:
> OK, I guess I need to be more precise in my question.
> For each of the popular unices (Solaris, HP-UX, AIX, and - if
> possible - linux), can anyone answer the following question:
>
> Assuming that the locale is set to Unicode
What do you want to say with "locale is set to Unicode":
setlocale(LC_ALL, "Unicode");
result is garbage, for all I know
setlocale(LC_CTYPE, "qq_XX.utf8");
something else?
Does it includes special enabling Japanese and Chinese versions? (that may
use some EUC encoding for wchar_t, in order to ease some compatibility)
And of course, this highly depends on the release number (of libc, mainly).
>, what is in a wchar_t string? Is it UTF-32 or pseudo-UTF-16
> (i.e. UTF-16 code units, zero-extended to 32 bits)?
See above about pseudo UTF 16.
> I'm trying to find out what the O/S expects to be in a
> wchar_t string.
By the way, a Unix OS does not expects anything in wchar_t[]. It does not
care of them, in any single point I can thinking about.
There are libc fonctions that do process these: the mb*towc*/wc*tomb*
series, the wcs* serie, the w*/f*ws versions of the <stdio.h> functions,
some features of classic printf and scanf. But nothing of this (as opposite
to Windows NT) is passed down to the OS, at least without a possibility to
inspect the result.
> The reason we want to know this is that we want to be able to write a
> function that converts from UTF-8 (stored in a char []) to wchar_t []
> properly. Obviously the function may need to behave differently on
> different flavours of unix.
OK, thanks for explaining your problem.
Basically, if wchar_t encode UTF-32, you are free of any problem. Clearly,
this is (should be) what all current releases will do. So your problem is
how can you handle these old versions (which ones) that does not know about
surrogates, and will expect a surrogate pair to be stored in two wchar_t
cells. And then will handle this "correctly", as much as this may mean
anything.
Did I reformulate your question correctly?
A way to see this is, what happens to some old Unix (or anything else) when
feeded with plane 1 characters? I would say (assume it is not right
brocken), well, nothing special: before Unicode 3.1, the standard was ISO
10646 31-bit form, which says every value until 0x7FFFFFF may be used, and
even says that the greater values may indeed be used (for private use: this
is by the way the biggest incompatibility of introducing the limitation to
U+10FFFD). So your >0xFFFF values should be handled correctly by the OS,
which will not do anything special. In particular, of course, it will not
print them, since it does not have any clue about such characters, whatever
the encoding used!
So I think, the bottom line is, who cares of the encoding for upper planes?
(provided it *is* Unicode for the lower groups: as you should see, this is
difficult to say for sure)
On the other hand, encoding surrogate characters as two wchar_t is very very
likely to bring you a lot of problems, for no real benefit I can envision.
Furthermore, it only matters for old platforms that are fading away, hence
there are maintenance difficulties to add.
Go on, encode as UTF-32, whatever libc really expected. Ultimately, the only
one that will use the datas will be you, anyway!
Hope it helps,
Antoine
PS: if you write the UTF-8 to UTF-32 decoder, you should also write the
reverse encoder: to left the OS doing the coding back to UTF-8 won't give
you useful results.
This archive was generated by hypermail 2.1.5 : Tue Mar 02 2004 - 06:28:59 EST