From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Dec 11 2004 - 11:34:19 CST
Kenneth Whistler wrote:
> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.
Technically, I am not asking anything. I am just trying to discuss an
approach which I think can be used to solve certain problems. And this
approach does not need to be conformant at this point. If someone finds it
suitable to make it conformant, even better, but at this point this is
irrelevant to the discussion. Unless it is proven that it cannot be made
conformant (by changing or amending the standard) because I have missed an
important fact. But so far, I have not seen such a proof.
But suppose I am asking, therefore proposing - it would be several separate
items:
1 - To assign codepoints for 128 (or 256) new surrogates(*), used for:
1.1 - Representing unassigned values when converting from an encoding to
Unicode (optional).
1.2 - Representing invalid sequences when interpreting UTF-8 (optional).
The use of these would not be mandatory. Existing handling is still an
option and can be preserved wherever it suits the needs, or changed where
the new behavior is beneficial.
Representation of these codepoints in UTF-8 would be as per current
standard.
2 - An alternative conversion from Unicode, to, say, UTF-8E (UTF-8E is _NOT_
Unicode(*)).
This conversion would reconstruct the original byte sequence, from a Unicode
string obtained by 1.2. This conversion pair intended for use on platform or
interface boundaries if/where it is determined that they are suitable. For
example, interfacing UNIX filesystem and a UTF-8 pipe would require
UTF-8E<=>UTF-8 conversion. Interfacing UNIX filesystem and Windows
filesystem would require UTF-8E<=>UTF-16 conversion.
(*) If proposal #2 would not be accepted, then codepoints in proposal #1
would actually not be surrogates, but simply codepoints and nothing else.
Even if proposal #2 is accepted, it is still not clear if those should
really be called surrogates, since they would convert among all UTF's just
as any other codepoint and only their representation in UTF-8E would differ.
Note that UTF-8E is not Unicode, but would be standardized in Unicode. IF U
in UTF is a problem, then any other name can be chosen. Consider it a
working name and be aware of what it is and is not.
3 - If UTC cannot agree that BMP should be used for proposal #1, I would
advise against a decision to assign non-BMP codepoints for the purpose. I
believe less damage would be done by postponing the decision than by making
a wrong decision. It is not just about how much disk space or bandwidth is
used. For example, if both filesystems have a 256 characters limit for a
filename, limitations are consistent (at least in one direction) if BMP is
used, and not if any other plane is used.
4 - If neither of the proposals is accepted, it would be beneficial if UTC
would manage to preserve at least one suitable block (for example U+A4xx or
U+ABxx) of 256 codepoints intact to facilitate a future decision.
Lars Kristan
This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 11:39:32 CST