From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 27 2003 - 18:04:06 EST
Tex Texin asked:
> Hmm, is that true?
Yes, it is true. All the standard *mandates* is what I quoted
previously in this thread:
"C12a When a process interprets a code unit sequence which purports
to be in a Unicode character encoding form, it shall treat
ill-formed code unit sequences as an error condition, and
shall not interpret such sequences as characters."
> Is it ok then, if I detect an unpaired surrogate, mutter
> "oops I have an error" and then drop that surrogate and continue processing
> the file, resulting in a valid utf-8 file?
Hmm, I think you may be mixing the UTF-16 case with the UTF-8
case, but...
If that is what you tell your customers, clients, or calling APIs that
you are explicitly doing to corrupted, ill-formed UTF-8 data, and
if they think that is o.k., then you've got two happy users of
the standard.*
The problem, of course, is that if you are implementing a public
API or service, just dropping corrupted bytes in a sequence can
create security problems or other difficulties, and people would
be well-advised to avoid such software that claims to
"auto-fix corrupted data", at least in such a crude way.
> I thought for some reason this was prohibited, but if the standard does not
> prescribe error handling, than this seems legit.
The basic constraint is that "conformant processes cannot interpret
ill-formed code unit sequences." Beyond that, the UTC has, from time
to time, tried to provide some guidance regarding what is or is
not reasonable for a process to do when confronted with bad data
of this type, but spelling out in any kind of detail what a process
should do with bad data is essentially out of scope for the standard.
Think of it this way. Does anyone expect the ASCII standard to tell,
in detail, what a process should or should not do if it receives
data which purports to be ASCII, but which contains an 0x80 byte
in it? All the ASCII standard can really do is tell you that
0x80 is not defined in ASCII, and a conformant process shall not
interpret 0x80 as an ASCII character. Beyond that, it is up to
the software engineers to figure out who goofed up in mislabelling
or corrupting the data, and what the process receiving the bad data
should do about it.
--Ken
*Example: You have dedicated signal-processing software dealing
with a data link messaging astronauts orbiting Titan. That data
link is using UTF-8, uncompressed, for some reason, and you are
having trouble with data dropouts. Your solution is to transmit
every message 3 times, drop any corrupted sections, and then
use a best match algorithm of some sort to compare the 3
messages and fill in any missing sections from the versions that
are not corrupted, thus reconstructing all the gaps. Of course,
there are much better approaches to self-correcting data
transmission, but you get the idea. This would be a perfectly
valid and conformant way to use UTF-8 data.
>
> tex
>
>
> Kenneth Whistler wrote:
> > Absolutely. Error handling is a matter of software design, and not
> > something mandated in detail by the Unicode Standard.
This archive was generated by hypermail 2.1.5 : Thu Feb 27 2003 - 18:41:03 EST