On Mon, 18 Mar 2002 16:44:19 -0800 (PST), Kenneth Whistler wrote:
> Your concern about old software behaving gracefully when dealing with
> an updated version of a data stream is a valid one that we
> know we will
> run into -- the additions for the euro sign in many code pages were a
> recent case in point. But if software designers follow the fallback
> guidelines (U+FFFD for unavailable conversion, missing glyph
> for display,
> and so on) then older software shouldn't choke when
> encountering previously
> unencoded characters in newer data streams.
Shouldn't choke. Hmmmm. Depends on your definition of choking. Displaying a
replacement character seems good enough if you know which character to
expect. But a series of such characters would become a complete mystery.
Also, I would expect to have a choice between selecting another codeset and
using the (irreversible!) U+FFFD fallback. If so, then I could choke to
death clicking OK more and more frequently. Or choke someone else... Oh yes,
or upgrade. Which is not always for free...
Then there is a problem of sending the data back. I used 'sending' here,
because I am thinking things like unsubscribing, replying and so on. You can
say that it takes a twisted mind to use the euro sign in a user name.
Perhaps. Perhaps a twisted mind of a hacker... Now, can you see
roundtripping (even if it really *is* garbage) a good thing in *this*
context?
>
> >
> > Then my proposal could be viewed as an addition to option
> C, with one
> > difference. Instead of one replacement character, I propose
> to have 256
> > (though in most cases only 128 would be used). Now, what
> does that violate?
>
> Parsimony and good sense.
>
> And it seems to have overlooked the fact that not all conversions
> are defined on single-byte character encodings to Unicode. What if you
> were converting EUC-JP to Unicode?
I had UTF-8 in mind all along. And yes, any invalid sequence would be
'preserved' byte by byte. Just like UTF-8B does. The fact that the number of
replacement characters does not always match the number of unrecognised
(potential) characters is not of paramount importance.
With MBCS, there could be an implementation problem if a converter (or a
function in it) assumes that a multi-byte combination will always map to a
single codepoint. This is an interesting question on its own, but I will
delay it for now.
To make it clear, my primary interest in replacement characters is because
of UTF-8 -> UTF16 -> UTF-8 roundtrips. Any possible uses in the SBCS or MBCS
area would be a plus, but are secondary.
later on, Kenneth Whistler wrote:
> purporting to be
> EUC-JP gets pumped at a convertor, just so you can maintain
> round-trippability
> of the garbage? I don't think this is any more useful than throwing an
> exception (to the error handler, by the way, not to the secretary on
> the third floor), and dumping the input into a sanitary can labelled
> "invalid data which was labelled 'EUC-JP' on input".
The secretary here was an example of an error handler (to remind that humans
make decisions, not some virtual error handlers), an error handler, that may
be unable to select a different codeset, is not confident enough to accept
data loss, and would like a 'reasonably safe' third alternative since the IT
guy is having a lunch break. And BTW, when speaking of throwing an exception
in this context, I cannot resist the "throwing out the window" comparison
when I know that data will be lost when the exception is handled. Throwing
an exception will be fine with me when one of the possible actions will be
to preserve the offending data in a consistent and officially supported
manner.
OK, I have another scenario for you. A UNIX machine and an NT machine,
connected via NFS or whatever. A user on the NT machine (client) stores
files on the UNIX server. Programs on NT use Unicode (UTF-16), while UNIX
filesystem is - well - raw text. Notice that nothing is 'labeled'.
Conversion will typically be based on user's locale settings, and note that
setting the codeset on the server side (even if per directory) would not
help at all. Imagine:
1 - User needs a file, saved by another user who uses different language
settings. I am worried about the filename here, not the content. If the name
is in 8859-1, but user uses EUC or UTF-8, then the user will sometimes not
be able to open the file.
2 - User won't even be able to delete such files. Enter admin, spending an
hour a day to deal with things he "didn't need to deal with before UTF-8 was
starting to spread".
Some of this may actually be happening already. Unfortunately, 8859-1 does
not have this problem, because it is 'full', i.e. all 256 values are already
mapped. Unfortunately? Yes, because this makes many people unaware of this
problem. So, as long as you are using 8859-1, you can process all the
garbage you want. And we have been doing it, all this time.
So, in this case, the converter was fed data, that was not labeled at all,
but a certain mapping table was suitable enough for most of the data so that
someone decided to use it. And a filename that happens to be in another
codeset may be garbage from converter's point of view, but from user's
perspective, converter itself (or software using such a converter) is a
*piece of garbage* since it prevents the user from opening a file, that
probably contains properly labeled data (and probably no garbage).
And if the user can see the problem as being caused by UTF-8? What will the
user do? Stop using UTF-8 and tell everybody about his bad experience with
it?
Failing to process *apparent* garbage may seem a good approach, but it only
works well in a perfect world. Ignoring the existing data that cannot be
labeled will lead to two things:
A - It will slow down the transition to Unicode wherever such data will not
be processed as gracefully as possible.
B - It will force developers to invent ways around the problem in order to
prevent A from happening. The many approaches will have all the problems(2)
that we are aware of (but can be dealt with), plus some of their own
problems(3), plus the problem(4) of diversity when it will be necessary to
deal with all of the problems(2) and problems(3).
As for parsimony, I would call it parsimony to deal with only one problem at
a time. And the priority at this point should be to help people to make a
smooth transition to UTF-8. Making sure that all the data is correctly
converted should of course be considered, but only as long as it is not an
obstruction in achieving the first goal.
>
> By the way, just to turn the screw here a little bit, how would legacy
> software that uses U+FFFD correctly for dealing with unavailable
> conversions be supposed to react when it comes across new
> GARBAGE CONVERSION
> BYTE characters that were undefined when it was written? How do you
> expect unaware conversion implementations to deal with your mechanism
> for maintaining convertibility for older software unable to deal with
> new data streams? Right -- it won't handle it correctly, and your
> garbage convertibility hints will be garbaged away, and you
> still can't
> get your roundtrip garbage.
I am not sure that I understand your point here. In order to get roundtrip
(where and if I want it), I need 128 *legal* codepoints (as opposed to
UTF-8B, which uses illegal codepoints, a.k.a. unpaired surrogates). Any
software that chooses (or happens to) treat these codepoints as regular
characters may do so. I am not saying that this data will roundtrip safely
in any situation. If an old converter is used on input or output, it will
break that. So will a new converter that is configured (well, given a
parameter) to behave in the old way. So will an application that will filter
the Unicode stream itself to reject (certain or all) sequences of BYTE
SUBSTITUTION CHARACTERS for security reasons.
Note however, that I am primarily interested in roundtrips where the data
takes the same path there and back. So, the following:
8 -new-> 16 -old-> 8 -old-> 16 -new-> 8
DOES roundtrip.
And, just to turn the screw here a little bit ;), I think that everybody
agrees that UTF-8B *can* be used, as long as the data is kept 'internal'.
That's fine if you are thinking about editors. But what if I use a database
to store the data? My internal data is all of a sudden external data to the
database, at its mercy to reject it or corrupt it. BYTE SUBSTITUTION
CHARACTERS would however be handled correctly. If the database is exported
to a UTF-8 text file, I don't really care if its converter will use the old
or new conversion approach. In either case, I will get my BYTE SUBSTITUTION
CHARACTERS back if the same conversion approach is used on import. And I can
get my original byte stream back when I use the new conversion approach to
feed the data back to the UNIX system.
Lars Kristan
This archive was generated by hypermail 2.1.2 : Wed Mar 27 2002 - 15:39:13 EST