Parallelization with multiple solves

Dear all,

I am trying to solve an optimization problem of the type:

\min\limits_{a} J(a,u_1, u_2, \ldots, u_N) = \sum_{i=1}^{N} J_i(a, u_i)

where a is a field and each J_i comes from solving an independent PDE for u_i.

The ideal approach would be to solve these PDEs in parallel, but I am having trouble making dolfin-adjoint generate the tape correctly.

Here is my attempt at producing a minimal working example:

""" minimal working example: figuring out parallelism in dolfin-adjoint """

from fenics import *
from fenics_adjoint import *

comm = MPI.comm_world

mesh = UnitSquareMesh(MPI.comm_self, 64, 64)
V = FunctionSpace(mesh, 'CG', 1)
v = TestFunction(V)
bc = DirichletBC(V, 0.0, "on_boundary")
d = Expression("2*pow(pi, i)*sin(pi*x[i])",i=0, degree=3)
f = interpolate(Expression('x[0]+x[1]', degree=1), V)

def forward(g, i):
    u = Function(V)
    F = (inner(grad(u), grad(v)) - g * v) * dx
    solve(F == 0, u, bc)
    J = assemble((0.5*inner(u - d, u - d))*dx + 1e-6*f**2*dx)
    return J

if comm.size == 1:
    J = forward(f, 0) + forward(f, 1)

    if MPI.rank(comm)==0:
        Jp = forward(f, 0)
    elif MPI.rank(comm)==1:
        Jp = forward(f, 1)

    data = comm.gather(Jp, root=0)
    if MPI.rank(comm)==0:
        J = sum(data)

tape = get_working_tape()

Looking at the tape generated, using 1 processor I get:

Using two processors, however, the main problem seems to be that only what comes from one of the processors is “visible” to the tape:

So how should I proceed in order to run the optimization with many calls to “solve()” in parallel?

Thank you for the consideration.