Re: wchar_t (was RE: 32'nd bit & UTF-8)

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Mon Jan 24 2005 - 03:53:37 CST

  • Next message: Peter Constable: "RE: Actually, this wasn't rhetorical"

    wchar_t (was RE: 32'nd bit & UTF-8)Lars Kristan wrote:
    > What is wchar_t?

    Historically, it is a way the Unix vendors (Sun, Apollo/HP, etc.) found
    around 1986-87 (led to WPI in 1990) to deal with double-byte character sets
    (used in East Asia) with the same algorithms used with 8-bit char.

    > Yes, it is a Unicode related type.

    Unicode has a breaking property that distinguishes it from the historical
    wchar_t: the encoding is the same whatever the locale, that is, hanzi "one"
    is always the same character (I think it is U+4E00).
    Of course, as being greater than 8 bits, some implementations built their
    Unicode support onto wchar_t; others kept them separate. The relationship is
    not tight.

    > It does not imply Unicode.

    How something "invented" around 1987 can imply another thing "invented"
    three year after?

    > Back to wchar_t. Let's introduce wchar32_t.

    Actually, the move is toward char32_t.

    > Most of Unicode
    > functions can be implemented using that type. But it may also be
    > useful to define some of those functions for UTF-8 strings.

    Yes. See ICU.

    > Do we need a new type for that? In C, one would get away
    > with the char type, but for C++ it would be useful to introduce
    > the wchar8_t type.

    I may agree with your reasonment, but I am not qualified enough to acurately
    discuss it.

    > Now notice that while you can implement some functions for
    > wchar32_t type with characters, the same function for wchar8_t type
    > must (well, should) operate on strings:

    Many people that work on implementations of Unicode already noticed that the
    APIs should be based on strings. Even if operating with UTF-32 units. The
    typical example is the uppercase of ß.
    Of course, the result is that very often, you are wasting. It is the price
    to pay for the change of abstraction level. Whether or not you can afford
    this price, or only part of it, is related to your project: some can, and
    other cannot.

    On another space is the lack of widely accepted C API for Unicode. I believe
    the reason is in the lack of accepted basis for the handling (creation,
    destruction) of strings with the standard C library. C++ and Java (and about
    all widely-used languages except Fortran) are different on this respect.

    > And, finally, to get back to the text vs binary distinction. On
    > UNIX, (wchar8_t *) would equal (char *).

    This is only true if you restrict to a utf-8-locale, or either a locale
    whose character set is a subset of UTF-8 (such as US ASCII). It will not be
    true with e.g. a Latin-1 locale.

    > The other problem is that (wchar8_t *) based processing might not
    > be possible, for example if a platform does not provide even the
    > (wchar8_t *) wrappers.

    What does you mean here?
    What does mean "does not provide the wrappers"?
    The classic behaviour for any library is to provide the wrappers even if it
    is sometimes unnecessary (then, they are simply not used, or they are
    no-ops). If you are saying that in determinate cases (but not always) there
    is a need for additional wrappers, then a correct implementation should
    provide them. Or it should claim it cannot support the other platform. Which
    may in turn restrict the acceptation of the new library/paradigm.

    > But there could be an incurred cost if you need to
    > constantly convert from UTF-8 (wchar8_t *) each time you want to
    > call system APIs.

    Anyway, there is a cost with Windows when you are using unadapted programs:
    since the (NT) kernel operate with UTF-16 strings everywhere, there is a
    additional cost for about every call to a system API. Should it be from the
    ACP codepage or UTF-8 is probably not relevant, just a bunch of work that is
    not yet done (or not finished, since I believe ICU provide a fair share of
    your proposal, although with a different presentation).

    > Both problems

    Sorry: I paid attention to your post. But I did not find any "problem"
    there. And certainly not something which can be solved with such a _simple_
    addition such as 128 new codepoints (reminds me a lot like the language
    tags, BTW; not that I intent a parallel, though).

    Antoine



    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 13:49:51 CST