Re: UTF-8 Corrigendum, new Glossary

From: Mark Davis (mark@macchiato.com)
Date: Thu Nov 30 2000 - 10:12:37 EST


We know of specific situations that caused problems, as outlined in the
Corrigendum.

a.. Process A performs security checks, but does not check for non-shortest
forms.
a.. Process B accepts the byte sequence from process A, and transforms it
into UTF-16 while interpreting non-shortest forms.
a.. The UTF-16 text may then contain characters that should have been
filtered out by process A.
a.. Process C interprets the text, and does something bad.

The case was with "..\". It was "hidden" in a non-longest form. Process A
missed it. After this was all done, Process C interpreted it, and executed a
program in a higher level directory that the client should not have had
access to. While a correctly written set of programs would not fall prey to
this problem, the UTC decided that given the real-world situations it would
be better to close off that avenue.

So what about interpreting a surrogate pair encoded in UTF-8 as two separate
3-byte sequences? (For example, interpreting the UTF-8 sequence <ED A0 80 ED
B0 80> as UTF-16 <D800 DC00> (equivalently as UTF-32 <00010000>)). It is
still permissible according to the conformance rules to interpret such UTF-8
sequences, although not to generate them. The Unicode Technical Committee
has debated this last issue at length, but has not made a final decision
about how to deal with it. It is complicated by widespread practice of
actually generating those types of sequences in older software.

Mark

----- Original Message -----
From: "G. Adam Stanislav" <adam@whizkidtech.net>
To: "Unicode List" <unicode@unicode.org>
Sent: Wednesday, November 29, 2000 22:42
Subject: Re: UTF-8 Corrigendum, new Glossary

> At 21:08 29-11-2000 -0800, Mark Davis wrote:
> >1. The Unicode Technical Committee has modified the definition of UTF-8
to
> >forbid conformant implementations from interpreting non-shortest forms
for
> >BMP characters,
>
> I find this silly. That creation of such forms would be forbidden I can
see
> and agree to. But interpretation? I understand the reasoning when security
> is an issue. But why make it flat illegal? There are many applications
> where such a sequence poses no security danger.
>
> Whatever happened to the ancient "abusus non tollit usum" principle? Looks
> like Big Brother to me...
>
> Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT