From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Mar 02 2004 - 11:33:05 EST
Philippe Verdy wrote on 3/1/2004, 4:10 PM:
> What's in a wchar_t string on unix?What you'll put or find in wchar_t is
> application dependant.
Absolutely not true. wchar_t is COMPILER and C LIB implementation
dependent, not "applicaton dependant".
Why it is COMPILER dependent? It is because the ANSI/C syntax L"string"
need to be convert into wchar_t* by the compiler.
Why it is C LIB implementation dependent? It is because the C LIB
implementation need to know how to handle those wchar_t inside those
standard ANSI/C mbtowc or mbstowcs routines.
It is NOT "application dependant"!!!
> But there's only a guarantee to find a single
> code unit
> (not necessarily a codepoint) for characters encoded in the source and
> compiled
> with the appropriate source charset. But this charset is not necessarily
> Unicode.
> At run-time, functions in the standard libraries that work with or
> return wide
> strings only expect these strings to be encoded according to the
> current locale
> (not necessarily Unicode).
How to stuff the locale encoding into a wchar_t is also necessary
straight forward. I once defined a algorithm to stuff 7 planes (two
bytes each, range from 0x2121-0x7e7e) of CNS 11643 into a 2 bytes
wchar_t ( 94 x 94 x 7 = 61852 < 2^16 = 65536) while I work for III on
UNIX Traditional Chinese support on SVR4. In that case, what stored in
wchar_t is neither Unicode, nor euc_tw but some code sequence agree
between mbtowc and wctomb.
> So if you run your program in an environment where the locale is
> ISO-8859-2,
> you'll find code units whose value between 0 and 255 match their
> position in the
> ISO-8859-2 standard,
That may be true by a specific implementation of a specific version. But
that is not even necessary true for all implementation.
> but you won't find the corresponding character
> codepoints
> as defined by Unicode.
> A wchar_t can then be used with any charset whose minimum code unit
> size is
> lower than or equal to the size of the wchar_t type. This may be an
> Unicode
> encoding form, or any other encoding (except UTF-32 if wchar_t is
> defined as a
> 16-bit integer type, which is not enough to represent every single
> Unicode
> codepoint).
> wchar_t is then only convenient for Unicode, as it is generally larger
> than
> char,
100% disagree with the above statement. In fact, wchar_t is NOT
origionally designed with Unicode at all. It is mainly designed for
handling the iteration of multibyte characters set locale (Shift_JIS,
euc_jp, euc_tw, gb2312, euc_kr, etc) easier.
> but its presence does not mean it will support UTF-16 or UTF-32
> (in ANSI
> C, wchar_t is allowed to represent the same type as char). [...]
Same "size" as char, not same "type" as char.
> Unlike Java's "char" type which is always an unsigned 16-bit integer
> on all
> platforms, there's no standard size for wchar_t in C and C++...
Agree.
This archive was generated by hypermail 2.1.5 : Tue Mar 02 2004 - 12:01:39 EST