Re: wchar_t (was RE: 32'nd bit & UTF-8)

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Mon Jan 24 2005 - 03:53:37 CST

Next message: Peter Constable: "RE: Actually, this wasn't rhetorical"

Previous message: Philippe VERDY: "mapping invalid bytes to invalid code units in deserializers for internal processing"
In reply to: Lars Kristan: "wchar_t (was RE: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

wchar_t (was RE: 32'nd bit & UTF-8)Lars Kristan wrote:
> What is wchar_t?

Historically, it is a way the Unix vendors (Sun, Apollo/HP, etc.) found
around 1986-87 (led to WPI in 1990) to deal with double-byte character sets
(used in East Asia) with the same algorithms used with 8-bit char.

> Yes, it is a Unicode related type.

Unicode has a breaking property that distinguishes it from the historical
wchar_t: the encoding is the same whatever the locale, that is, hanzi "one"
is always the same character (I think it is U+4E00).
Of course, as being greater than 8 bits, some implementations built their
Unicode support onto wchar_t; others kept them separate. The relationship is
not tight.

> It does not imply Unicode.

How something "invented" around 1987 can imply another thing "invented"
three year after?

> Back to wchar_t. Let's introduce wchar32_t.

Actually, the move is toward char32_t.

> Most of Unicode
> functions can be implemented using that type. But it may also be
> useful to define some of those functions for UTF-8 strings.

Yes. See ICU.

> Do we need a new type for that? In C, one would get away
> with the char type, but for C++ it would be useful to introduce
> the wchar8_t type.

I may agree with your reasonment, but I am not qualified enough to acurately
discuss it.

> Now notice that while you can implement some functions for
> wchar32_t type with characters, the same function for wchar8_t type
> must (well, should) operate on strings:

Many people that work on implementations of Unicode already noticed that the
APIs should be based on strings. Even if operating with UTF-32 units. The
typical example is the uppercase of ß.
Of course, the result is that very often, you are wasting. It is the price
to pay for the change of abstraction level. Whether or not you can afford
this price, or only part of it, is related to your project: some can, and
other cannot.

On another space is the lack of widely accepted C API for Unicode. I believe
the reason is in the lack of accepted basis for the handling (creation,
destruction) of strings with the standard C library. C++ and Java (and about
all widely-used languages except Fortran) are different on this respect.

> And, finally, to get back to the text vs binary distinction. On
> UNIX, (wchar8_t *) would equal (char *).

This is only true if you restrict to a utf-8-locale, or either a locale
whose character set is a subset of UTF-8 (such as US ASCII). It will not be
true with e.g. a Latin-1 locale.

> The other problem is that (wchar8_t *) based processing might not
> be possible, for example if a platform does not provide even the
> (wchar8_t *) wrappers.

What does you mean here?
What does mean "does not provide the wrappers"?
The classic behaviour for any library is to provide the wrappers even if it
is sometimes unnecessary (then, they are simply not used, or they are
no-ops). If you are saying that in determinate cases (but not always) there
is a need for additional wrappers, then a correct implementation should
provide them. Or it should claim it cannot support the other platform. Which
may in turn restrict the acceptation of the new library/paradigm.

> But there could be an incurred cost if you need to
> constantly convert from UTF-8 (wchar8_t *) each time you want to
> call system APIs.

Anyway, there is a cost with Windows when you are using unadapted programs:
since the (NT) kernel operate with UTF-16 strings everywhere, there is a
additional cost for about every call to a system API. Should it be from the
ACP codepage or UTF-8 is probably not relevant, just a bunch of work that is
not yet done (or not finished, since I believe ICU provide a fair share of
your proposal, although with a different presentation).

> Both problems

Sorry: I paid attention to your post. But I did not find any "problem"
there. And certainly not something which can be solved with such a _simple_
addition such as 128 new codepoints (reminds me a lot like the language
tags, BTW; not that I intent a parallel, though).

Antoine

Next message: Peter Constable: "RE: Actually, this wasn't rhetorical"
Previous message: Philippe VERDY: "mapping invalid bytes to invalid code units in deserializers for internal processing"
In reply to: Lars Kristan: "wchar_t (was RE: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 13:49:51 CST