RE: Missing values in mapping-tables?

From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Mar 15 2002 - 15:01:14 EST


>
> In fact, there are no characters defined in ISO 8859-8 for those code
> points. If you encounter 0xBF in text that purports to be ISO 8859-8,
> it is an error.
>
Another example showing that it would be very useful to have 128 (possibly
256) codepoints that would be reserved for such purposes.

Suppose ISO 8859-8 is ever upgraded (even if not likely, but - for the sake
of argument). One might say that it would be bad to change an existing
definition in the table e.g. for 0xBF from 0x2DBF to 0x20AC. But how is that
worse from changing it from <undefined> to 0x20AC ?
I think it is actually better, since you can never guess what will be
implemented for <undefined>. "Throw and exception" is what I keep seeing in
these discussions. Who will catch it? The secretary on the third floor?

If mapping for undefined values would be 0xhh -> 0x2Dhh, then there would be
a consistent definition of what to do if somebody wants to do something else
than throw things out the window. Consequentially, there would be a better
chance of being able to repair inadvertently processed data at some later
time.

Yes, I am talking about the roundtrip issue again. Thanks to David Hopwood's
reply (see http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0362.html),
I am now convinced that unpaired surrogates (UTF-8B) are not a good
approach. However, the %hh mechanism has many drawbacks (like potentially 3
times longer strings after conversion). Which will lead to use of other
mechanisms, and all of them will have the same multiple representation
problem. My point is - not providing a suitable mechanism (or at least means
for it) within Unicode will not make the multiple representation problem go
away. IMHO, it would be better to accept the multiple representation problem
for a fact and try to deal with it.

I think that 128 codepoints is not such a high price to pay for what we
could do with them. True, all the things people could do with these
codepoints might bring up new issues (yes, security too), but then again,
isn't it better that everybody does the dirty things in a consistent and
well known manner? At least then you have a chance to have a single
validator to find potentially problematic codepoints or sequences.

Of course, I am not saying that the definition of UTF-8 would be changed.
These reserved codepoints would only allow a UTF-8D algorithm, that would be
somewhat simpler than UTF-8B and would produce valid UTF-16 data, thus
significantly expanding possibilities for its usage.

Lars Kristan
Storage & Data Management Lab
HERMES SoftLab



This archive was generated by hypermail 2.1.2 : Fri Mar 15 2002 - 14:22:29 EST