RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 14 2004 - 05:30:54 CST

  • Next message: Arcane Jill: "RE: Roundtripping in Unicode"

    > Ken is absolutely right. It would be theoretically possible
    > to add 128 code
    > points that would allow one to roundtrip a bytestream after
    > passing through
    > a UTF-8 <=> UTF-32 conversion. (For that matter, it would be
    > possible to add
    > 2048 code points that would allow the same for a 16-bit data stream.)
    You don't really need to add anything for 16-bit <=> UTF-32. There is no
    real-life need to have that roundtrip guaranteed. For 8-bit data there is
    real-life need. And even, for 16-bit <=> UTF-32 you can do it simply by
    defining how surrogates should be processed. Not saying it should be done,
    but showing it could be done. But for UTF-8 <=> UTF-32 it cannot be done
    without 128 new codepoints. Which is why I am often comparing these 128
    codepoints to the surrogates. With one difference, they should be valid
    characters.

    >
    > However, these new code points would really be no better than
    > private use
    > code points, since their interpretation would depend entirely
    Oh yes they would. Anyone might be using those same codepoints in PUA for
    something completely different.

    > on whatever
    > was assumed to be the interpretation of the original bytestream. If X
    > converted a bytestream that was assumed to be a mixture of
    > 8858-7 with UTF-8
    > into Unicode with these new characters, and handed it off to Y, who
    > converted the bytestream back assuming that the odd bytes were to be
    > iso-8859-9, you would get data corruption. X and Y would have
    Nope. No data corruption. You just get the odd bytes back. And achieve
    exactly the same as if X passed the data directly to Y. Y doesn't convert
    from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9.
    It converts UTF-8 to the original byte stream and ONLY THEN interpretes it
    as iso-8859-9. So, the same as if it got the data directly.

    Lars



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 05:34:54 CST