Re: UTF-8 Corrigendum, new Glossary

From: Kenneth Whistler (
Date: Thu Nov 30 2000 - 20:37:19 EST

Adam said:

> On Thu, Nov 30, 2000 at 10:18:07AM -0800, Markus Scherer wrote:
> >you are free to write and use a non-conformant implementation. just be aware of what that means... :-)
> >markus
> I guess it means I'm a non-conformist. :)
> I am currently working on software that translates mark-up made in one
> mark-up language (Ister) and translates it into another (HTML). It
> uses UTF-8, and works as CGI, i.e., generates HTML dynamically on a web
> server (see for unfinished docs).
> If the source (in Ister) uses illegal but decipherable UTF-8, my
> software accepts it. Naturally, before it sends it out it transforms
> it to perfectly legal UTF-8. The idea I should reject it is silly
> (and, no, the "internal data" clause does not apply here: my software
> accepts data from an external source).

Basically, you've already answered your own question. If you recognize
that your source data is "illegal UTF-8", and if you know that you
are passing it into a controlled environment where it does not
pose a security risk, than effectively you can have one layer that
unmangles the "decipherable" but illegal UTF-8, and passes it to the
layer that interprets legal UTF-8.

As long as this is above board and explicit, then you should be o.k.
It is the conversion process that just silently interprets non-shortest
UTF-8 without discrimination in an uncontrolled environment that
is dangerous.

> Rejecting it would mean
> that if the web page designer used some design software that messed
> up the UTF-8 encoding, the web page would suddenly miss a letter here,
> a letter there. Not rejecting it poses no security risk, so, for this
> specific application it is better to accept it (and correct it) than
> to reject it.

As I read it, this would fall under the mangled text note.

The "internal" note is referring to functions that don't *check* for
illegal code unit sequences. Those are not conformant unless being
used on certifiably legal data. But if your function is checking and
catching illegal code unit sequences explicitly, you can fix them
and proceed, as long as you know you are not part of a process pipeline
that could lead to a security problem by doing so.

The point of the UTF-8 corrigendum was not to force people to do
unreasonable things with their software, but rather to tighten up
the definition sufficiently so that people could claim secure
implementations of UTF-8.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT