RE: Missing values in mapping-tables?

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 18 2002 - 19:44:19 EST


Lars Kristan suggested:

> OK, another way of looking at all this. I believe you would accept three
> options:
> A - Reject the stream.
> B - Drop the invalid data.

If you were defining an application concerned with security, and if
you had a clearly defined conversion you were performing, yes these
would be valid options, as if your table conversion is correctly defined,
you are being fed garbage.

> C - Replace the invalid characters with U+FFFD (the replacement character).

This, however, is the more graceful and robust way to handle conversions
that are undefined in your conversion table -- and is the way recommended
by the Unicode Standard.

Your concern about old software behaving gracefully when dealing with
an updated version of a data stream is a valid one that we know we will
run into -- the additions for the euro sign in many code pages were a
recent case in point. But if software designers follow the fallback
guidelines (U+FFFD for unavailable conversion, missing glyph for display,
and so on) then older software shouldn't choke when encountering previously
unencoded characters in newer data streams.

>
> Then my proposal could be viewed as an addition to option C, with one
> difference. Instead of one replacement character, I propose to have 256
> (though in most cases only 128 would be used). Now, what does that violate?

Parsimony and good sense.

And it seems to have overlooked the fact that not all conversions
are defined on multi-byte character encodings to Unicode. What if you
were converting EUC-JP to Unicode? Of the 65,536 two-byte combinations,
40,253 are illegal, 7359 involve at least one control code, and might
be questionable to convert, depending, and of the 8,836 legal A1..FE/A1..FE
combinations, many are not actually defined for JIS X 0208. And then there
are 3-byte combinations in 0x8F..., most of which are also illegal or
undefined. Are you proposing that we use 256 "GARBAGE CONVERSION BYTE-00"..
"GARBAGE CONVERSION BYTE-FF" characters in arbitrary sequences to replicate
all these illegal values into a Unicode stream if garbage purporting to be
EUC-JP gets pumped at a convertor, just so you can maintain round-trippability
of the garbage? I don't think this is any more useful than throwing an
exception (to the error handler, by the way, not to the secretary on
the third floor), and dumping the input into a sanitary can labelled
"invalid data which was labelled 'EUC-JP' on input".

By the way, just to turn the screw here a little bit, how would legacy
software that uses U+FFFD correctly for dealing with unavailable
conversions be supposed to react when it comes across new GARBAGE CONVERSION
BYTE characters that were undefined when it was written? How do you
expect unaware conversion implementations to deal with your mechanism
for maintaining convertibility for older software unable to deal with
new data streams? Right -- it won't handle it correctly, and your
garbage convertibility hints will be garbaged away, and you still can't
get your roundtrip garbage.

--Ken



This archive was generated by hypermail 2.1.2 : Mon Mar 18 2002 - 20:34:22 EST