From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 13:27:18 CST
On 2005/01/18 03:31, Philippe Verdy at verdy_p@wanadoo.fr wrote:
>>> The old RFC you're refering to is not designating UTF-8, but UTF-BSS,
>>> which is a transformation format,
>>
>> OK. Fine, so we have a name for it.
>
> I was not sure about the name of it when writing the message.
According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, UTF is short for
UCS Transformation Format, where UCS stands for Universal Character Set.
When speaking about the extensions that I speak about, I think they should
certainly have a separate name. Perhaps UTF-8X for extended, or BTF-8 for
"bit (byte) transformation format".
I should mention that in the first version of the UTF-8 and UTF-32 regular
expression generator functions for Unicode character classes that I wrote, I
excluded the illegal Unicode numbers, overloaded as well as U+D800-U+DFFF
and U+FFFE-U+FFFF. But it then turns out that the lexer generator then
becomes more complicated. So I felt it prudent to add regular expression
generator functions also for the overloaded UTF-8 numbers, so as to make it
convenient to do generate error handling.
The Unicode standard is like Big Brother in George Orwell's "1984", making
it possible to only speak about what is right, but not what is wrong. The
lexer generator needs to be able to speak about what is wrong as well, in
order to give proper handling to that.
Besides, even though Unicode has declared to never use more than 21 bits, in
the track record, Unicode has reneged on such promises. It might be prudent
to knock down a full 32-bit encoding, declaring UTF-8/32 to be subsets of
that.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 13:30:33 CST