RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Dec 11 2004 - 09:52:37 CST

  • Next message: Carl W. Brown: "RE: Software support costs (was: Nicest UTF"

    Marcin 'Qrczak' Kowalczyk wrote:
    > Lars Kristan <lars.kristan@hermes.si> writes:
    >
    > > The other name for this is roundtripping. Currently, Unicode allows
    > > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are
    > > several reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more
    > > valuable, even if it means that the other roundtrip is no longer
    > > guaranteed:
    >
    > It's essential that any UTF-n can be translated to any other without
    > loss of data. Because it allows to use an implementation of the given
    > functionality which represents data in any form, not necessarily the
    > form we have at hand, as long as correctness is concerned. Avoiding
    > conversion should matter only for efficiency, not for correctness.
    When I am talking about roundtrip, I speak of arbitrary data, not just valid
    data. Roundtrip for valid data is of course essential and needs to be
    preserved.

    >
    > > Let me go a bit further. A UTF-16=>UTF-8=>UTF-16 roundtrip is only
    > > required for valid codepoints other than the surrogates. But it also
    > > works for surrogates unless you explicitly and
    > intentionally break it.
    >
    > Unpaired surrogates are not valid UTF-16, and there are no surrogates
    > in UTF-8 at all, so there is no point in trying to preserve UTF-16
    > which is not really UTF-16.
    Actually, there is a point. It is just that you fail to understand it. But
    then, you needn't worry about it, since it is outside of your area of
    interest. So, as far as you are concerned, I can do with surrogates anything
    I like, right? If UTC takes 128 unassigned codepoints and declares them to
    be a new set of surrogates, you needn't worry either (your valid data will
    still convert to any UTF). Unless you have a strict validator which already
    validates unpaired surrogates. But you don't. I am pretty sure about it.

    >
    > > I would opt for the latter (i.e. keep it working), according to my
    > > statement (in the thread "When to validate") that validation should
    > > be separated from other processing, where possible.
    >
    > Surely it should be separated: validation is only necessary when data
    > are passed from the external world to our system. Internal operations
    > should not produce invalid data from valid data. You don't have to
    > check at each point whether data is valid. You can assume that it is
    > always valid, as long as the combination of the programming language,
    > libraries and the program is not broken.
    >
    > Some languages make it easier to ensure that strings are valid, to the
    > point that they guarantee it (they don't offer any way to construct
    > an invalid string). Unfortunately many languages don't: they say that
    > they represent strings in UTF-8 or UTF-16, but they are unsafe, they
    > do nothing to prevent constructing an array of words which is not
    > valid UTF-8 or UTF-16 and passing it to functions which assume that
    > it is. Blame these languages, not the definitions of UTF-n.
    Blaming solves nothing. In this case it is just a philosophical excercise.
    If a user encounters corrupt data and cannot process it with your program,
    she ("she" is 'politically correct', but in this case can be seen as sexism)
    will blame it on the program, not the data. The fact that your program
    conforms to Unicode standard doesn't help you. Another program that doesn't,
    might work. If the user chooses to use this other program instead of yours,
    who will you blame?

    >
    > > All this is known and presents no problems, or - only problems that
    > > can be kept under control. So, by introducing another set of 128
    > > 'surrogates', we don't get a new type of a problem, just another
    > > instance of a well known one.
    >
    > Nonsense. UTF-8, UTF-16 and UTF-32 are interchangeable, and you would
    > like to break this. No way.
    Not in the way you would need to worry about. Did UTF-16 break UCS-2? No,
    because the codepoints that were assigned to surrogates were not used
    before. Same thing here.

    > > On top of it, I repeatedly stressed that it is UTF-8 data
    > that has the
    > > highest probablility of any of the following:
    > > * contains portions that are not UTF-8
    > > * is not really UTF-8, but user has UTF-8 set as default encoding
    > > * is not really UTF-8, but was marked as such
    > > * a transmission error not only changes data but also
    > creates invalid
    > > sequences
    >
    > In this cases the data is broken and the damage should be signalled as
    > soon as possible, so the submitter can know this and correct it.
    This has been discussed mails back. UNIX filenames are already 'submitted'.
    Once you set your locale to UTF-8, you have labelled them all as UTF-8.
    Suggestions?

    Lars



    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 09:59:57 CST