Dear all,

I am trying to solve an optimization problem of the type:

\min\limits_{a} J(a,u_1, u_2, \ldots, u_N) = \sum_{i=1}^{N} J_i(a, u_i)

where a is a field and each J_i comes from solving an independent PDE for u_i.

The ideal approach would be to solve these PDEs in parallel, but I am having trouble making dolfin-adjoint generate the tape correctly.

Here is my attempt at producing a minimal working example:

```
""" minimal working example: figuring out parallelism in dolfin-adjoint """
from fenics import *
from fenics_adjoint import *
comm = MPI.comm_world
mesh = UnitSquareMesh(MPI.comm_self, 64, 64)
V = FunctionSpace(mesh, 'CG', 1)
v = TestFunction(V)
bc = DirichletBC(V, 0.0, "on_boundary")
d = Expression("2*pow(pi, i)*sin(pi*x[i])",i=0, degree=3)
f = interpolate(Expression('x[0]+x[1]', degree=1), V)
def forward(g, i):
u = Function(V)
F = (inner(grad(u), grad(v)) - g * v) * dx
solve(F == 0, u, bc)
d.i=i
J = assemble((0.5*inner(u - d, u - d))*dx + 1e-6*f**2*dx)
return J
if comm.size == 1:
J = forward(f, 0) + forward(f, 1)
else:
if MPI.rank(comm)==0:
Jp = forward(f, 0)
elif MPI.rank(comm)==1:
Jp = forward(f, 1)
data = comm.gather(Jp, root=0)
if MPI.rank(comm)==0:
J = sum(data)
tape = get_working_tape()
tape.visualise()
```

Looking at the tape generated, using 1 processor I get:

Using two processors, however, the main problem seems to be that only what comes from one of the processors is “visible” to the tape:

So how should I proceed in order to run the optimization with many calls to “solve()” in parallel?

Thank you for the consideration.