From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 15 2004 - 05:58:49 CST
Marcin 'Qrczak' Kowalczyk wrote:
> But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
> NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
> awkward way which would happen to exclude those subsequences of
> non-characters which would form a valid UTF-8 fragment.
NOT-UTF-16 -> NOT-UTF-8 -> NOT-UTF-16 was never a goal. Nor was UTF-16 ->
NOT-UTF-8 -> UTF-16, or NOT-UTF-16 -> UTF-8 -> NOT-UTF-16.
UTF-16 -> UTF-8 -> UTF-16 is preserved and that keeps the goals of UTF
intact.
The goal, BTW, is: NOT-UTF-8 -> UTF-16 -> NOT-UTF-8.
> Question: should a new programming language which uses Unicode for
> string representation allow non-characters in strings? Argument for
> allowing them: otherwise they are completely useless at all, except
> U+FFFE for BOM detection. Argument for disallowing them: they make
> UTF-n inappropriate for serialization of arbitrary strings, and thus
> non-standard extensions of UTF-n must be used for serialization.
My opinion:
It should allow them and process them usefully. Furthermore, this 'usefully'
should not be up to developers to discover. It should be researched,
described, well, in the end even standardized. IMHO, UTC should consider
leading this process, even if it does not end with anything standardized in
Unicode standard.
Validation should be completely separated from processing. IMHO.
Lars
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 06:05:55 CST