Re: Roundtripping in Unicode

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Dec 13 2004 - 16:04:19 CST

  • Next message: Magda Danish \(Unicode\): "FW: Subj: Displaying Chinese characters and Chu Nom characters"

    Ken is absolutely right. It would be theoretically possible to add 128 code
    points that would allow one to roundtrip a bytestream after passing through
    a UTF-8 <=> UTF-32 conversion. (For that matter, it would be possible to add
    2048 code points that would allow the same for a 16-bit data stream.)

    However, these new code points would really be no better than private use
    code points, since their interpretation would depend entirely on whatever
    was assumed to be the interpretation of the original bytestream. If X
    converted a bytestream that was assumed to be a mixture of 8858-7 with UTF-8
    into Unicode with these new characters, and handed it off to Y, who
    converted the bytestream back assuming that the odd bytes were to be
    iso-8859-9, you would get data corruption. X and Y would have to agree on
    the interpretation of these odd bytes to avoid that corruption, so it is
    really no different than private use (where they also have to agree on the
    interpretation).

    ‎Mark

    ----- Original Message -----
    From: "Kenneth Whistler" <kenw@sybase.com>
    To: <lars.kristan@hermes.si>
    Cc: <unicode@unicode.org>
    Sent: Monday, December 13, 2004 13:04
    Subject: RE: Roundtripping in Unicode

    > Lars Kristan stated:
    >
    > > I said, the choice is yours. My proposal does not prevent you from doing
    it
    > > your way. You don't need to change anything and it will still work the
    way
    > > it worked before. OK? I just want 128 codepoints so I can make my own
    > > choice.
    >
    > You have them: U+EE80..U+EEFF, which are yours to use (or abuse)
    > in an application as you see fit. Just don't expect others outside
    > your application to interpret them as you do.
    >
    > > And once and for all, you can treat those 128 codepoints just as you
    > > do today.
    >
    > A number of people on the list have patiently explained why what
    > you are proposing to do fundamentally breaks UTF-8 and its
    > relationship to other Unicode encoding forms.
    >
    > The chances that you will get the standard extended to incorporate
    > these 128 code points and define their mapping to invalid byte
    > values in UTF-8 is somewhere between zilch, nada, and nil.
    >
    > --Ken
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 16:12:09 CST