Re: UTF-8, U+0000 and Software Development (was: Re: New UTF-8 decoder stress test file)

From: Karl Pentzlin (karl-pentzlin@acssoft.de)
Date: Sun Sep 26 1999 - 17:10:20 EDT


-----Ursprüngliche Nachricht-----
Von: Paul Dempsey (Exchange) <paulde@Exchange.Microsoft.com>
An: 'Karl Pentzlin' <karl-pentzlin@acssoft.de>; Unicode List
<unicode@unicode.org>
Gesendet: Sonntag, 26. September 1999 21:11
Betreff: RE: UTF-8, U+0000 and Software Development (was: Re: New UTF-8
decoder stress test file)

> Using UTF-8 to represent a 0 byte without 0-valued bytes is misusing UTF-8
> (at least for text interchange).
>
> ...
>
> I've written quite a lot of text-processing code in C/C++ that handles
> embedded NUL characters. There's nothing intrinsic to the language that
> makes it especially difficult. I just don't use much of the standard ISO C
> library.
That is (somewhat exaggerated), to conform to one standard (UTF-8 encoding
U+0000 strictly by a 0 byte), you decide against another standard (the ISO
standard C libraries) - a standard which also was made for interchange,
namely for program source interchange between different operating systems.

The other point is, C/C++/Deplhi/Java programmers *will* use the 0xC0 0x80
encoding of U+0000 regardless whether it strictly conform to the standard or
not, as it makes the life for them much easier (and their work for their
bosses cheaper). Therefore, you *will* find 0xC0 0x80 in text interchange
files, whether the standard allows it or not, and therefore real-world
applications *will* treat this encoding correctly (especially as the
deviation from the written standard is that small, is transparent for all
users, and U+0000 is not especially frequent in real text anyway - thus the
cost/advantage ratio will in no case justify strict standard conformance
economically). So, maybe in about 10 or 30 years, the standard anyway will
bow to the "normative Kraft des Faktischen" (~ normative power of the facts)
and declare 0xC0 0x80 as correct. That´s the way it goes in the real outside
world. Maybe it is better to consider that in advance, even if it requires
to sacrify purity.

Regards
Karl Pentzlin
AC&S Analysis Consulting & Software GmbH
Ganghoferstraße 128
D-81373 München, Germany

>
> Thanks,
> --- Paul Chase Dempsey
> Microsoft Visual Studio Text Editor Development
>
> > -----Original Message-----
> > From: Karl Pentzlin [mailto:karl-pentzlin@acssoft.de]
> > Sent: Sunday, September 26, 1999 11:21 AM
> > To: Unicode List
> > Subject: UTF-8, U+0000 and Software Development (was: Re: New UTF-8
> > decoder stress test file)
> >
> >
> > Software developers (especially those using the languages C,
> > C++, Delphi et
> > al.) have to deal with byte sequences which must not contain
> > any byte with
> > value 0, because 0 denotes the end of the byte sequence.
> > While there was no possibility to have a string of any 8-bit
> > code (where all
> > characters are "encoded" by their byte values itself)
> > containing a character
> > of value 0 (as long as you confine to the standard library
> > functions for
> > strings), this may change when you go to encode your
> > character sequences
> > using UTF-8 - as long you are allowed to encode U+0000 as
> > 0xC0 0x80 (i.e.
> > 11000000 10000000). If UTF-8 (for good reasons outside of software
> > development concerns) disallows 11000000 10000000,
> > programmers are again
> > confronted with the problem being able to encode any value
> > but U+0000 within
> > strings, although UTF-8 could solve this problem.
> >
> > There are two possible solutions:
> >
> > 1. To allow two "conformance levels" of UTF-8:
> > a. "strict": U+0000 has to be encoded as 00000000
> > b. "special" (or named whatever): U+0000 may (or even may
> > only) be encoded
> > as 11000000 10000000
> >
> > 2. To regard UTF-8 (-like) sequences which may contain
> > 11000000 10000000 but
> > not 00000000 as "meta-encoding", i.e. the UTF-8 sequence is encoded by
> > itself: 00000000 is encoded as 11000000 1000000 which can be decoded
> > unequivocally to 00000000 as 11000000 10000000 is not a valid
> > UTF-8 sequence
> > and therefore cannot stand for itself, as all other byte
> > (sequence)s do.
> >
> > In my opinion, this discussion should be continued together with the
> > standard bodies concerned with the programming languages.
> >
> > Regards
> > Karl Pentzlin
> > AC&S Analysis Consulting & Software GmbH
> > Ganghoferstraße 128
> > D-81373 München, Germany
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Valeriy E. Ushakov <uwe@ptc.spbu.ru>
> > An: Unicode List <unicode@unicode.org>
> > Cc: Unicode List <unicode@unicode.org>; <linux-utf8@humbolt.geo.uu.nl>
> > Gesendet: Sonntag, 26. September 1999 19:11
> > Betreff: Re: New UTF-8 decoder stress test file
> >
> >
> > > On Sun, Sep 26, 1999 at 09:22:26AM -0700, Markus Kuhn wrote:
> > >
> > > > 4.3 Overlong representation of the NUL character
> > > >
> > > > The following five sequences should also be rejected like
> > malformed
> > > > UTF-8 sequences and should not be treated like the ASCII NUL
> > > > character.
> > > >
> > > > 4.3.1 U+0000 = c0 80 = "?"
> > >
> > > I belive that's exactly what JDK uses to encode U+0000 in utf-8
> > > encoded NUL terminated C strings to distinguish U+0000 which is part
> > > of a string from the terminating NUL. I can't find the reference,
> > > though.
> > >
> > > SY, Uwe
> >



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT