From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 08:19:12 CST
Hans Aberg wrote:
> The situation is the same as that the values > 0x7F are
> illegal in ASCII.
> When people made ASCII, they fantasized it was the end of it,
> and that the
> full 8 bits would never be used. At least Don Knuth says so.
> Now the Unicode
> people evidently wants people to pretend that the values >
> 0x10FFFF don't
> exist.
There are good reasons for both. I won't go into why ASCII was 7 bit. But
the 0x10FFFF limitation is there because UTF-16 can't handle more. Indeed,
UTF-16 is the least fortunate of the UTFs, but it won't go away any time
soon.
I would say that the current limit is enough for many years. By the time we
run out, not only will UTF-16 be gone, but perhaps also UTF-8. Text will be
a spit in the ocean and will be transmitted and stored in UTF-128 :)
If by any chance UTF-8 survives, it can be extended, probably more or less
the way it was proposed. But it doesn't need to be extended today.
I am not sure about what lexers are, but I gathered you want to convert all
Unicode data to UTF-8 and process it in UTF-8, possibly directly process any
8-bit stream.
This is a good approach. In your case it came naturally since it simplifies
whatever you are doing. But it is a very good approach in general. You would
have far more problems if you'd want to convert everything to UTF-16 or
UTF-32. Then you'd have the problem of invalid sequences, which Unicode says
is not their problem. Unfortunately, invalid sequences cannot be solved
efficiently and unambiguously without dedicating new codepoints. So it
cannot be done without cooperation from Unicode. You, on the other hand, can
use an extended definition of UTF-32 to UTF-8 conversion if you choose so
and need no approval from Unicode. All you need is to be careful to select
the best algorithm. I think at least three variants already emerged. But
even that decision will probably not be crucial. Persuading Unicode to
recognise your algorithm is doomed to fail. Unicode has no need to define it
and will not define it until there is need for it. Until then, they will
observe what is going on an learn on your and other people's mistakes.
Not that I completely agree with that attitude. In some cases, yes, as is
your case and the case of the UTF-8 BOM. But there are other cases where it
would indeed be useful if Unicode would sometimes address issues that fall
slightly out of its domain. Like handling invalid sequences, actions on
invalid sequences or invalid/non characters, transformation of invalid/non
characters where this transformation can be done, and so on. Specifically, I
think transforming an unpaired surrogate should be defined. On the other
hand, I think it is a bit early to define transformation for a 32 bit value.
Lars
This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 08:20:19 CST