RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Dec 08 2004 - 20:45:59 CST

  • Next message: Donald Z. Osborn: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

    Lars responded:

    > > ... Whatever the solutions
    > > for representation of corrupt data bytes or uninterpreted data
    > > bytes on conversion to Unicode may be, that is irrelevant to the
    > > concerns on whether an application is using UTF-8 or UTF-16
    > > or UTF-32.

    > The important fact is that if you have an 8-bit based program, and you
    > provide a locale to support UTF-8, you can keep things working (unless you
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^

    You can keep *some* things *sorta* working.

    If you don't make the effort to actually upgrade software to
    use the standard *conformantly*, then it is no real surprise when
    data corruptions creep in, characters get mislaid, and some things
    don't work the way they should.
                                        
    > prescribe validation). But you cannot achieve the same if you try to base
    > your program on 16 or 32 bit strings.

    Of course you can. You just have to rewrite the program to handle
    16-bit or 32-bit strings correctly. You can't pump them through
    8-bit pipes or char* API's, but it's just silly to try that, because
    they are different animals to begin with.

    By the way, I participated as an engineer in a multi-year project
    that shifted an advanced, distributed data analysis system
    from an 8-bit character set to 16-bit Unicode. *All* user-visible string
    processing was converted over -- and that included proprietary
    file servers, comm servers, database gateways, networking code,
    a proprietary 32-bit workstation GUI implementation, and a suite
    of object-oriented application tools, including a spreadsheet,
    plotting tool, query and database reporting tools, and much more.
    It worked cross-platform, too.

    It was completed, running, and *delivered* to customers in 1994,
    a decade ago.

    You can't bamboozle me with any of this "it can't be done with
    16-bit strings" BS.

    > Or, again, you really cannot with 16
    > bit (UTF-16),

    Yes you can.

    > and you sort of can with 32 bit (UTF-32), but must resort to
    > values above 21 bits.

    No, you need not -- that is non-conformant, besides.

    > Again, nothing standardized there, nothing defined for
    > how functions like isspace should react and so on.

    That is wrong, too. The standard information that people seek
    is in the Unicode Character Database:

    http://www.unicode.org/Public/UNIDATA/

    And there are standard(*) libraries such as ICU that public API's
    for programs to use to get the kind of behavior they need.

    (*) Just because a library isn't an International Standard does
    not mean that it is not a de facto standard that people can
    and do rely upon for such program behavior.

    You can't expect to just rely upon the C or C++ standards
    and POSIX to solve all your application problems, but there
    are perfectly good solutions working out there, in UTF-8,
    in UTF-16, and in UTF-32. (Or in combinations of those.)

    > And it's about the fact that it is far more likely that this
    > happens to UTF-8 data (or that some legacy data is mistakenly labelled or
    > assumed to be UTF-8).
    > UTF-16 data is far cleaner than 8-bit data. Basically because you had to
    > know the encoding in order to store the data in UTF-16.

    Actually, I think this should be characterized as software engineers
    writing software for UTF-16 are likely to do a better job of
    handling characters, because they have to, whereas a lot of
    stuff using UTF-8 just slides by, because people think they can
    ignore character set issues long enough, so that when the problem
    occurs, it can no longer be traced to mistakes they made or
    that they are still held responsible for. ;-)

    > UTF-8 is what solved the problems on UNIX. It allowed UNIX to process
    > Windows data. Alongside its own.
    > It is Windows that has problems now. And I think roundtripping is the
    > solution that will allow Windows to process UNIX data. Without dropping data
    > or raising exceptions. Alongside its own.

    I just don't understand these assertions at all.

    First of all it isn't "UNIX data" or "Windows data" -- it is
    end user's data, which happens to be processed in software
    systems which in turn are running on a UNIX or Windows OS.

    I work for a company that *routinely* runs applications that
    cross the platform barriers in all sorts of ways. It works
    because character sets are handled conformantly, and conversions
    are done carefully at platform boundaries -- not because some
    hack has been added to UTF-8 to preserve data corruptions.

    > > There's more to it, of course, but this is, I believe, as the
    > > bottom of the reason why, for 12 years now, people have been
    > > fundamentally misunderstanding each other about UTF-8.
    > Is it 12? Thought it was far less.

    Yes. The precursor of UTF-8 was dreamed up around 1992.

    > Off topic, when was UTF-8 added to
    > Unicode standard?

    In Unicode 1.1, Appendix F, then known as "FSS-UTF", in 1993.

    > Quite close. Except for the fact that:
    > * U+EE93 is represented in UTF-32 as 0x0000EE93
    > * U+EE93 is represented in UTF-16 as 0xEE93
    > * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93)

    Utterly non-conformant.

    >
    > Which could be understood as "a proposal to amend UTF-8 to allow invalid
    > sequences".

    O.k., and as pointed out already, that simply won't fly. *Nobody*
    in the UTC or WG2 is going to go for that. It would destroy
    UTF-8, not fix it.

    > Suppose unpaired surrogates are in fact legalized for this purpose.

    Also utterly nonconformant.

    > From the perspective of plain text, yes, roundtrip for invalid sequences in
    > UTF-8 has nothing to do with it. It would be great if there was no need for
    > it. Storing arbitrary binary data would then be just a proposal for a thing
    > that doesn't belong in Unicode. The proposed codepoints for rountripping can
    > indeed be misused for (or misinterpreted as) storing binary data. But this
    > fact does not constitute an argument against them.

    Well, I think it does, actually. It *is* a storing of binary data.
    Not *arbitrary* binary data -- you wouldn't use this to store
    pictures in text. But it is binary data -- byte values representing
    uninterpreted values in a byte stream.

    > If the purpose of Unicode is to to define bricks for plain text, then what
    > the hell are the surrogates doing in there?

    This seems to represent a complete misunderstanding of the Unicode
    character encoding forms.

    This is completely equivalent to examining all the UTF-8 bytes
    and then asking "what the hell are 0x80..0xF4 doing in there?"
    And if you don't understand the analogy, then I submit that
    you don't understand Unicode character encoding forms. Sorry.

    > > Storage of UNIX filenames on Windows databases, for example,
    > > can be done with BINARY fields, which correctly capture the
    > > identity of them as what they are: an unconvertible array of
    > > byte values, not a convertible string in some particular
    > > code page.
    > Sigh. Storing is just a start. Windows filenames are also stored in the same
    > database. And eventually, you need to have data from both of them in the
    > same output.

    Then you need an application architecture that is sophisticated
    enough to maintain character set state and deal with it correctly.
    You can't just use 8-bit pipes, wave your hands, and assume that
    it will all work out in the end.

    > Or, for example, one might want to compare filenames from one
    > platform with the filenames from the other. All this is impossible in
    > UTF-16.

    Nonsense.

    > > with whatever escape you need in place to deal with your escape
    > > convention itself. In either case, the essential problem is
    > First, I am glad you are not proposing this approach for my problem. There
    > is a concern with size there, which is why I used the PUA from BMP and not
    > the other one (although it would be safer, perhaps). And why I am speaking
    > of defining these codepoints in BMP.
    >
    > OK, let's take a look at escaping. It works fine if there are few errors and
    > if the intent is to read a document. A self descriptive escape would then be
    > suitable.

    O.k. we can agree on that much. :-)

    > It wouldn't work well if there are many errors. The text would lose its
    > original form.

    That's the point at which you either need to get more sophisticated
    (or at least *correct*) software, or you start hitting Delete, because
    you are getting trash.

    > > getting applications to universally support the convention
    > > for maintaining and interpreting the corrupt bytes. Simply
    > > encoding 128 characters in the Unicode Standard ostensibly to
    > > serve this purpose is no guarantee whatsoever that anyone would
    > > actually implement and support them in the universal way you
    > > envision, any more than they might a "=93", "=94" convention.
    > Are you really saying that whatever is standardized has no better chance of
    > being used than anything else?

    Yep.

    > Can this really be used as a
    > counter-argument?

    Yep.

    It's as follows: why bother to standardize some speculative
    solution to a problem that has other possible approaches that
    are theoretically more sound, particularly when there is
    no guarantee whatsoever that the speculative solution would
    be used in practice or solve the problem?

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 20:50:02 CST