RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 14 2004 - 05:30:54 CST

Next message: Arcane Jill: "RE: Roundtripping in Unicode"

Previous message: Lars Kristan: "RE: Roundtripping in Unicode"
Maybe in reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Mark Davis: "Re: Roundtripping in Unicode"
Maybe reply: Lars Kristan: "RE: RE: Roundtripping in Unicode"
Maybe reply: Philippe VERDY: "Re: RE: Roundtripping in Unicode"
Reply: Mark Davis: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Ken is absolutely right. It would be theoretically possible
> to add 128 code
> points that would allow one to roundtrip a bytestream after
> passing through
> a UTF-8 <=> UTF-32 conversion. (For that matter, it would be
> possible to add
> 2048 code points that would allow the same for a 16-bit data stream.)
You don't really need to add anything for 16-bit <=> UTF-32. There is no
real-life need to have that roundtrip guaranteed. For 8-bit data there is
real-life need. And even, for 16-bit <=> UTF-32 you can do it simply by
defining how surrogates should be processed. Not saying it should be done,
but showing it could be done. But for UTF-8 <=> UTF-32 it cannot be done
without 128 new codepoints. Which is why I am often comparing these 128
codepoints to the surrogates. With one difference, they should be valid
characters.

>
> However, these new code points would really be no better than
> private use
> code points, since their interpretation would depend entirely
Oh yes they would. Anyone might be using those same codepoints in PUA for
something completely different.

> on whatever
> was assumed to be the interpretation of the original bytestream. If X
> converted a bytestream that was assumed to be a mixture of
> 8858-7 with UTF-8
> into Unicode with these new characters, and handed it off to Y, who
> converted the bytestream back assuming that the odd bytes were to be
> iso-8859-9, you would get data corruption. X and Y would have
Nope. No data corruption. You just get the odd bytes back. And achieve
exactly the same as if X passed the data directly to Y. Y doesn't convert
from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9.
It converts UTF-8 to the original byte stream and ONLY THEN interpretes it
as iso-8859-9. So, the same as if it got the data directly.

Lars

Next message: Arcane Jill: "RE: Roundtripping in Unicode"
Previous message: Lars Kristan: "RE: Roundtripping in Unicode"
Maybe in reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Mark Davis: "Re: Roundtripping in Unicode"
Maybe reply: Lars Kristan: "RE: RE: Roundtripping in Unicode"
Maybe reply: Philippe VERDY: "Re: RE: Roundtripping in Unicode"
Reply: Mark Davis: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 05:34:54 CST