Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 18:09:33 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/18 21:25, Jon Hanna at jon@hackcraft.net wrote:

    >> Under C/C++, one will use a wchar_t which is always of exactly 32-bit,
    >> regardless what internal word structure the CPU is using in
    >> its memory bus.
    >
    > wchar_t can be 7bits in size or more than 128bits.

    Whatever it can be, modern platforms, such as GNU, have decided that it
    won't, but will be 32 bits. See
    <http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

    >>> Not sure if I understand you correctly. What about 00 vs.
    >> C0.80, E0.80.80,
    >>> FE.80.80.80.80.80.80 etc.?
    >>
    >> I have added functions that admit creating regular
    >> expressions also for the
    >> overloaded UTF-BSS ("UTF-8") multibytes. This way, a lexer can provide
    >
    > They aren't "overloaded", they are invalid.

    You probaly mean that the overloaded UTF-BSS (or whatever the correct name
    is) multibytes are illegal under UTF-8.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 18:13:34 CST