RE: Missing values in mapping-tables?

From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Mar 18 2002 - 05:22:42 EST


Again - 'invalid data' and 'garbage'. Because you're thinking old data with
old definition. How about new data and old software?

Your approach means that if a new character is defined in say ISO 8859-8,
then all old software should report it as error. And all users must upgrade.
When (and if!) an update is available.
My approach would mean that old software would not properly display (nor
collate) the new character, would however not reject the data. Recognizing
what the character actually was is not that hard and is something that many
of us did for years. And if the data is eventually converted back to the
same codeset, using the same (old) mapping table, the original data is
preserved.

There are two approaches: A - detecting the errors as early as possible, and
B - gracefully handling the data as long as possible. Both have its
benefits. I am very much in favor of the first, but sometimes it is simply
not possible to use that approach. Once you admit that people are entitled
to choose the second approach (depending on their needs), then it is useful
to have the behavior defined for it.

OK, another way of looking at all this. I believe you would accept three
options:
A - Reject the stream.
B - Drop the invalid data.
C - Replace the invalid characters with U+FFFD (the replacement character).

Then my proposal could be viewed as an addition to option C, with one
difference. Instead of one replacement character, I propose to have 256
(though in most cases only 128 would be used). Now, what does that violate?

Lars Kristan

> -----Original Message-----
> From: Doug Ewell [mailto:dewell@adelphia.net]
> Sent: Saturday, March 16, 2002 06:59
> To: unicode@unicode.org
> Cc: Lars Kristan
> Subject: Re: Missing values in mapping-tables?
>
>
> Lars Kristan <lars.kristan@hermes.si> wrote:
>
> > Suppose ISO 8859-8 is ever upgraded (even if not likely,
> but - for the
> sake
> > of argument). One might say that it would be bad to change
> an existing
> > definition in the table e.g. for 0xBF from 0x2DBF to 0x20AC. But how
> is that
> > worse from changing it from <undefined> to 0x20AC ?
> > I think it is actually better, since you can never guess
> what will be
> > implemented for <undefined>. "Throw and exception" is what I keep
> seeing in
> > these discussions. Who will catch it? The secretary on the third
> floor?
>
> "Defining" undefined code points to be something they aren't is not a
> Good Thing. Even if ISO 8859-8 were updated at some time in
> the future,
> with new code points being added, the old data that was
> created with the
> old 8859-8 would still contain invalid data.
>
> > If mapping for undefined values would be 0xhh -> 0x2Dhh, then there
> would be
> > a consistent definition of what to do if somebody wants to do
> something else
> > than throw things out the window. Consequentially, there would be a
> better
> > chance of being able to repair inadvertently processed data at some
> later
> > time.
>
> It's not repairable, because it contained garbage.
>
> -Doug Ewell
> Fullerton, California
>
>
>



This archive was generated by hypermail 2.1.2 : Mon Mar 18 2002 - 04:54:18 EST