Re: Roundtripping in Unicode

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Dec 15 2004 - 12:04:18 CST

Next message: Peter Kirk: "Re: Roundtripping Solved"

Previous message: Mike Ayers: "RE: Roundtripping in Unicode"
In reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Arcane Jill: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Nope. No data corruption. You just get the odd bytes back. And achieve

I see more of what you are trying to do; let me try to be more clear.
Suppose that the conversion is defined in the following way, between Unicode
strings (D29a-d, page 74) and UTFs using your proposed new characters, for
now with private use code points U+E080..U+E0FF.

U8-UTF32. To convert an Unicode 8-bit string to UTF-32:
1. Set the pointer to the start
2. If the sequence starting at the pointer is a valid UTF-8 sequence
(checking of course to make sure it doesn't go off the end of the string),
convert it and emit.
3. Otherwise take the byte B following the pointer, and emit [E000 + B].

- Note that because all single bytes 00..7F are all valid UTF-8, #3
doesn't get invoked on anything but 80..FF.
4. Advance the pointer past what was used and repeat until done

UTF32-U8. To convert a UTF-32 to a Unicode 8-bit string:
1. Set the pointer to the start
2. If the code point C at the pointer is from E080 to E0FF, emit a single
byte, [C - E000]
3. Otherwise convert to the UTF-8 sequence and emit.
4. Advance the pointer past what was used and repeat until done

Taking any byte string, it would roundtrip when applying U8-UTF32 then
UTF32-U8. However, the reverse would not be true; UTF-32 strings would not
roundtrip through U8. For example,

start with UTF32: 000000A0 0000E0C2 0000E0A0
applying UTF32-U8, goes to: C2 A0 C2 A0
applying U8-UTF32, goes to: 000000A0 000000A0

Of course, a UTF32-UTF8 transformation would preserve these code points

000000A0 0000E0C2 0000E0A0 <=> C2 A0 EE 83 82 EE 82 A0

so it would behave differently than the UTF32-U8 conversion.

Of course, one could apply this process between the Unicode bit strings and
UTFs of other widths. And the same thing applies; one direction would
roundtrip and the other wouldn't.

start with UTF8: C2 A0 EE 83 82 EE 82 A0
applying UTF8-U8, goes to: C2 A0 C2 A0
applying U8-UTF8, goes to: C2 A0 C2 A0

(I realize that some of this may duplicate what others have said -- I
haven't had the time to follow this thread in any detail.)

‎Mark

----- Original Message -----
From: Lars Kristan
To: 'Mark Davis' ; Kenneth Whistler
Cc: unicode@unicode.org
Sent: Tuesday, December 14, 2004 03:30
Subject: RE: Roundtripping in Unicode

> Ken is absolutely right. It would be theoretically possible
> to add 128 code
> points that would allow one to roundtrip a bytestream after
> passing through
> a UTF-8 <=> UTF-32 conversion. (For that matter, it would be
> possible to add
> 2048 code points that would allow the same for a 16-bit data stream.)
You don't really need to add anything for 16-bit <=> UTF-32. There is no
real-life need to have that roundtrip guaranteed. For 8-bit data there is
real-life need. And even, for 16-bit <=> UTF-32 you can do it simply by
defining how surrogates should be processed. Not saying it should be done,
but showing it could be done. But for UTF-8 <=> UTF-32 it cannot be done
without 128 new codepoints. Which is why I am often comparing these 128
codepoints to the surrogates. With one difference, they should be valid
characters.
>
> However, these new code points would really be no better than
> private use
> code points, since their interpretation would depend entirely
Oh yes they would. Anyone might be using those same codepoints in PUA for
something completely different.
> on whatever
> was assumed to be the interpretation of the original bytestream. If X
> converted a bytestream that was assumed to be a mixture of
> 8858-7 with UTF-8
> into Unicode with these new characters, and handed it off to Y, who
> converted the bytestream back assuming that the odd bytes were to be
> iso-8859-9, you would get data corruption. X and Y would have
Nope. No data corruption. You just get the odd bytes back. And achieve
exactly the same as if X passed the data directly to Y. Y doesn't convert
from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9.
It converts UTF-8 to the original byte stream and ONLY THEN interpretes it
as iso-8859-9. So, the same as if it got the data directly.

Lars

Next message: Peter Kirk: "Re: Roundtripping Solved"
Previous message: Mike Ayers: "RE: Roundtripping in Unicode"
In reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Arcane Jill: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 12:13:41 CST