Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 12:52:17 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/18 13:00, Antoine Leca at Antoine10646@leca-marti.org wrote:
    >> With a full 32-bit encoding, one can also use UTF-8 to encoding
    >> binary data.
    >
    > Why?

    The UTF-BSS ("UTF-8") is not sensitive to the big/endian issue. And perhaps
    people might invent other, creative uses.

    > Look, I have two computers.
    > One runs generally DOS softwares programmed with TurboPascal, and dealing
    > with 32 unsigned datas is a nightmare (no built-in data type), I have to go
    > back to assembly for every operations on such "binary datas", or else using
    > the 64 signed data type using the FPU, but with a noticeable performance
    > hit.

    This is easily resolved by switching to say C/C++ and a better OS. :-) Or a
    Pascal compiler that supports a 32-bit integral data type -- it does not
    matter if it is signed or unsigned.

    > I very much prefer having "16-bit binary datas" with it ;-).

    If you look at <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, it looks as
    though that in the UNIX world, only UTF-8 and UTF-32 will be used. A
    significant matter is that GNU now reserves wchar_t for use with 32 bit
    integral types only. The UTF-16 is probably there only for upwards
    compatibility. One will probably be better off converting UTF-16 into UTF-8
    or UTF-32.

    > Of course, in the real world I am using streams (including counted strings)
    > of 8-bit datas, like anybody.

    Under C/C++ can actually use, apart from byte streams, other streams such as
    wchar_t. But then these are sensitive to the big/low endian issue.

    > The other has a 64-bit based architecture. I have difficulties to match your
    > proposition (about "full") above about it. In fact, I am already entangled
    > with softwares that was designed as "Unified architecture" and only
    > forecasted the use of 32-bit integers and pointers.
    > So I beg your pardon, but I feel a bit angry about your proposal.

    Under C/C++, one will use a wchar_t which is always of exactly 32-bit,
    regardless what internal word structure the CPU is using in its memory bus.
    So your concerns here are a non-issue. Moreover, the latest edition of C,
    C99, has types that the compiler can support where the sizes of the integral
    types are indicated. So if you choose a 32-bit such type, it will again be
    such regardless what the CPU uses on its memory bus.

    >> It also simplifies somewhat the implementation of
    >> Unicode in lexer generators (such as Flex): The leading byte then
    >> covers all 256 combinations. All 2^32 numbers should probably be
    >> there for generating proper lexer error messages.
    >
    > Not sure if I understand you correctly. What about 00 vs. C0.80, E0.80.80,
    > FE.80.80.80.80.80.80 etc.?

    I have added functions that admit creating regular expressions also for the
    overloaded UTF-BSS ("UTF-8") multibytes. This way, a lexer can provide
    proper error handling in those cases. Otherwise, Flex has now no official
    Unicode support, so it is unknown what using it for creating Unicode lexers
    will look like. This was the reason I brought the questions up here, to get
    some inputs.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 12:54:40 CST