From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 12:52:17 CST
On 2005/01/18 13:00, Antoine Leca at Antoine10646@leca-marti.org wrote:
>> With a full 32-bit encoding, one can also use UTF-8 to encoding
>> binary data.
>
> Why?
The UTF-BSS ("UTF-8") is not sensitive to the big/endian issue. And perhaps
people might invent other, creative uses.
> Look, I have two computers.
> One runs generally DOS softwares programmed with TurboPascal, and dealing
> with 32 unsigned datas is a nightmare (no built-in data type), I have to go
> back to assembly for every operations on such "binary datas", or else using
> the 64 signed data type using the FPU, but with a noticeable performance
> hit.
This is easily resolved by switching to say C/C++ and a better OS. :-) Or a
Pascal compiler that supports a 32-bit integral data type -- it does not
matter if it is signed or unsigned.
> I very much prefer having "16-bit binary datas" with it ;-).
If you look at <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, it looks as
though that in the UNIX world, only UTF-8 and UTF-32 will be used. A
significant matter is that GNU now reserves wchar_t for use with 32 bit
integral types only. The UTF-16 is probably there only for upwards
compatibility. One will probably be better off converting UTF-16 into UTF-8
or UTF-32.
> Of course, in the real world I am using streams (including counted strings)
> of 8-bit datas, like anybody.
Under C/C++ can actually use, apart from byte streams, other streams such as
wchar_t. But then these are sensitive to the big/low endian issue.
> The other has a 64-bit based architecture. I have difficulties to match your
> proposition (about "full") above about it. In fact, I am already entangled
> with softwares that was designed as "Unified architecture" and only
> forecasted the use of 32-bit integers and pointers.
> So I beg your pardon, but I feel a bit angry about your proposal.
Under C/C++, one will use a wchar_t which is always of exactly 32-bit,
regardless what internal word structure the CPU is using in its memory bus.
So your concerns here are a non-issue. Moreover, the latest edition of C,
C99, has types that the compiler can support where the sizes of the integral
types are indicated. So if you choose a 32-bit such type, it will again be
such regardless what the CPU uses on its memory bus.
>> It also simplifies somewhat the implementation of
>> Unicode in lexer generators (such as Flex): The leading byte then
>> covers all 256 combinations. All 2^32 numbers should probably be
>> there for generating proper lexer error messages.
>
> Not sure if I understand you correctly. What about 00 vs. C0.80, E0.80.80,
> FE.80.80.80.80.80.80 etc.?
I have added functions that admit creating regular expressions also for the
overloaded UTF-BSS ("UTF-8") multibytes. This way, a lexer can provide
proper error handling in those cases. Otherwise, Flex has now no official
Unicode support, so it is unknown what using it for creating Unicode lexers
will look like. This was the reason I brought the questions up here, to get
some inputs.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 12:54:40 CST