Re: Roundtripping in Unicode

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Dec 15 2004 - 12:04:18 CST

  • Next message: Peter Kirk: "Re: Roundtripping Solved"

    > Nope. No data corruption. You just get the odd bytes back. And achieve

    I see more of what you are trying to do; let me try to be more clear.
    Suppose that the conversion is defined in the following way, between Unicode
    strings (D29a-d, page 74) and UTFs using your proposed new characters, for
    now with private use code points U+E080..U+E0FF.

    U8-UTF32. To convert an Unicode 8-bit string to UTF-32:
    1. Set the pointer to the start
    2. If the sequence starting at the pointer is a valid UTF-8 sequence
    (checking of course to make sure it doesn't go off the end of the string),
    convert it and emit.
    3. Otherwise take the byte B following the pointer, and emit [E000 + B].

        - Note that because all single bytes 00..7F are all valid UTF-8, #3
    doesn't get invoked on anything but 80..FF.
    4. Advance the pointer past what was used and repeat until done

    UTF32-U8. To convert a UTF-32 to a Unicode 8-bit string:
    1. Set the pointer to the start
    2. If the code point C at the pointer is from E080 to E0FF, emit a single
    byte, [C - E000]
    3. Otherwise convert to the UTF-8 sequence and emit.
    4. Advance the pointer past what was used and repeat until done

    Taking any byte string, it would roundtrip when applying U8-UTF32 then
    UTF32-U8. However, the reverse would not be true; UTF-32 strings would not
    roundtrip through U8. For example,

    start with UTF32: 000000A0 0000E0C2 0000E0A0
    applying UTF32-U8, goes to: C2 A0 C2 A0
    applying U8-UTF32, goes to: 000000A0 000000A0

    Of course, a UTF32-UTF8 transformation would preserve these code points

         000000A0 0000E0C2 0000E0A0 <=> C2 A0 EE 83 82 EE 82 A0

    so it would behave differently than the UTF32-U8 conversion.

    Of course, one could apply this process between the Unicode bit strings and
    UTFs of other widths. And the same thing applies; one direction would
    roundtrip and the other wouldn't.

    start with UTF8: C2 A0 EE 83 82 EE 82 A0
    applying UTF8-U8, goes to: C2 A0 C2 A0
    applying U8-UTF8, goes to: C2 A0 C2 A0

    (I realize that some of this may duplicate what others have said -- I
    haven't had the time to follow this thread in any detail.)

    ‎Mark

    ----- Original Message -----
    From: Lars Kristan
    To: 'Mark Davis' ; Kenneth Whistler
    Cc: unicode@unicode.org
    Sent: Tuesday, December 14, 2004 03:30
    Subject: RE: Roundtripping in Unicode

    > Ken is absolutely right. It would be theoretically possible
    > to add 128 code
    > points that would allow one to roundtrip a bytestream after
    > passing through
    > a UTF-8 <=> UTF-32 conversion. (For that matter, it would be
    > possible to add
    > 2048 code points that would allow the same for a 16-bit data stream.)
    You don't really need to add anything for 16-bit <=> UTF-32. There is no
    real-life need to have that roundtrip guaranteed. For 8-bit data there is
    real-life need. And even, for 16-bit <=> UTF-32 you can do it simply by
    defining how surrogates should be processed. Not saying it should be done,
    but showing it could be done. But for UTF-8 <=> UTF-32 it cannot be done
    without 128 new codepoints. Which is why I am often comparing these 128
    codepoints to the surrogates. With one difference, they should be valid
    characters.
    >
    > However, these new code points would really be no better than
    > private use
    > code points, since their interpretation would depend entirely
    Oh yes they would. Anyone might be using those same codepoints in PUA for
    something completely different.
    > on whatever
    > was assumed to be the interpretation of the original bytestream. If X
    > converted a bytestream that was assumed to be a mixture of
    > 8858-7 with UTF-8
    > into Unicode with these new characters, and handed it off to Y, who
    > converted the bytestream back assuming that the odd bytes were to be
    > iso-8859-9, you would get data corruption. X and Y would have
    Nope. No data corruption. You just get the odd bytes back. And achieve
    exactly the same as if X passed the data directly to Y. Y doesn't convert
    from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9.
    It converts UTF-8 to the original byte stream and ONLY THEN interpretes it
    as iso-8859-9. So, the same as if it got the data directly.

    Lars



    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 12:13:41 CST