Run assemble on each MPI thread individually and broadcast vectors

Hi everybody,

the problem i am dealing with, involves 400+ equations and therefore puts up quite a challenge to optimize. A part of the optimization currently is to 'pre-'assemble parts of the equations, that are constant and thus can be reused in subsequent iterations. Even though this improved the solution time significantly, i still feel that these precomputations could be done more effectively.
As FEniCS does the assembly in parallel, but I rather tend to have a lot of equations and relatively small number of dof, the memory communication seems to dominate the process and causes bad scaling with additional MPI threads.
I wondered if there would be a way to do parts of the assembly on the entire domain and later distribute it back to each thread. Like this dummy code i wrote with numpy arrays (that i couldn’t manage to get to work with PETScVectors):

from dolfin import *
from mpi4py import MPI as pyMPI
import numpy as np

#MPI stuff
comm = MPI.comm_world    
size = comm.size
rank = comm.rank

#arbitrary array size
p=10

#size per MPI thread
chunk = int(np.ceil(p/size))

#create array on root 
array = None
if rank == 0:
    array = np.ones(size*chunk, dtype='d')*-1

#split 'tasks' equally among MPI threads
foo = np.array_split(range(p),size)[rank]

#init buffer
bar = np.ones(chunk, dtype='d')*-1

#do stuff locally on each thread
for i,v in enumerate(foo):
    bar[i] = rank  #just to see some result

# gather on root node
comm.Gather(bar, array, root=0)

#drop uninitialized data from each chunk
if rank == 0: array = np.asarray([v for v in array if v != -1])
else:         array = np.empty(p, dtype='d')

#broadcast trimmed array back to each thread
comm.Bcast(array, root=0)

print(rank, array.T)

I am fully aware that the whole purpose to MPI is to distribute the mesh to each node, but i just wondered if there is any way to get this to work with PETScVectors.

If anybody has any idea, i would be happy to read it. :slight_smile:

Greetings,
slydex

Why not just assemble what you don’t want to distribute on COMM_SELF and broadcast whenever you need it thereafter? Most dolfin classes take an MPI communicator as an arugment.