From: Arcane Jill (arcanejill@ramonsky.com)
Date: Wed Dec 15 2004 - 05:11:50 CST
I followed (and understood) Lar's explanation as to why the NOT-xxxx
solution wouldn't work for him. Shame really - but here's another bash at a
solution, again without breaking the Unicode model. If I have understood
this correctly, these are Lars' requirements:
1) There exists a function, f(), which maps an arbitrary octet stream to a
sequence of Unicode characters
2) A required property of f() is that, if any substring of its input is
valid UTF-8, then f() must convert that substring to the sequence of Unicode
characters which would have been obtained by UTF-8 itself.
3) There exists an inverse function, g(), such that g(a) == b if and only if
f(b) == a.
As Unicoders have pointed out, these goals appear to be mutually
contradictory, unless we assume the following corrollory, which I shall call
"requirement 4".
4) A second required property of f() is that, if any octet of its input is
not part of a valid UTF-8 substring, then f() must convert that octet to a
Unicode character string /which cannot possibly appear in Unicode plain
text/.
It is for reasons of requirement (4) that Lars proposes the introduction of
128 BMP codepoints. His intention is that they be marked as "reserved - do
not use", so that requirement 4 is met. Naturally, this proposal has met
with a lot of resistance, and almost certainly would never get approved by
the UC. Therefore, I propose an alternative solution, as follows:
DEFINITION - "f" is a function which maps an arbitrary octet stream to a
sequence of Unicode characters, such that (1) any substring which happens to
be valid UTF-8 is mapped to the sequence of Unicode characters which would
have been produced by UTF-8, and (2) all remaining single octets, xx (with x
necessarily such that 0x80 <= xx <= 0xFF) are each mapped to the sequence:
{ U+0C55E3, U+01ED7A, U+05FDCB, U+09C351, U+07E168, U+0BBC80, U+107C09,
U+0BA458, U+064188, U+048375, U+08ACE0, U+031DEF, U+00xx } (I got those
numbers from a true random number generator).
OBSERVATION - Requirement (4) is not met absolutely, however, the
probability of the UTF-8 encoding of this sequence occuring "accidently" at
an arbitrary offset in an arbitrary octet stream is approximately one in
2^384; the probability of its occuring in /plain text/ is even smaller. This
means that if your application were capable of processing one terabyte of
date per second, you would expect to encounter this sequence by accident
once every 2^340 years. (For reference, the Universe is somewhere around
2^13 years old). This means that requirement 4 is "effectively met", even if
not actually met.
DEFINITION - "g" is the inverse function of f. By the observation above, f
is injective, not bijective, so in the event of ambiguity, the sequence {
U+0C55E3, U+01ED7A, U+05FDCB, U+09C351, U+07E168, U+0BBC80, U+107C09,
U+0BA458, U+064188, U+048375, U+08ACE0, U+031DEF, U+00xx }is /always/
assumed to map to the single octet xx. The probability of this choice being
wrong is as stated above.
Now everything will work. Unicode is not broken. All UTFs are
interchangeable as before; Lars's "escape aware" applications can use the
functions f() and g() instead of UTF-8 transformations; all other Unicode
applications will retain Lars's data uncorrupted, and he can "unescape" it
(that is, apply function g()) at the appropriate time to recover the
original data.
That do?
Jill
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 05:19:23 CST