Run assemble on each MPI thread individually and broadcast vectors

slydex · March 16, 2020, 8:35am

Hi everybody,

the problem i am dealing with, involves 400+ equations and therefore puts up quite a challenge to optimize. A part of the optimization currently is to 'pre-'assemble parts of the equations, that are constant and thus can be reused in subsequent iterations. Even though this improved the solution time significantly, i still feel that these precomputations could be done more effectively.
As FEniCS does the assembly in parallel, but I rather tend to have a lot of equations and relatively small number of dof, the memory communication seems to dominate the process and causes bad scaling with additional MPI threads.
I wondered if there would be a way to do parts of the assembly on the entire domain and later distribute it back to each thread. Like this dummy code i wrote with numpy arrays (that i couldn’t manage to get to work with PETScVectors):

from dolfin import *
from mpi4py import MPI as pyMPI
import numpy as np

#MPI stuff
comm = MPI.comm_world    
size = comm.size
rank = comm.rank

#arbitrary array size
p=10

#size per MPI thread
chunk = int(np.ceil(p/size))

#create array on root 
array = None
if rank == 0:
    array = np.ones(size*chunk, dtype='d')*-1

#split 'tasks' equally among MPI threads
foo = np.array_split(range(p),size)[rank]

#init buffer
bar = np.ones(chunk, dtype='d')*-1

#do stuff locally on each thread
for i,v in enumerate(foo):
    bar[i] = rank  #just to see some result

# gather on root node
comm.Gather(bar, array, root=0)

#drop uninitialized data from each chunk
if rank == 0: array = np.asarray([v for v in array if v != -1])
else:         array = np.empty(p, dtype='d')

#broadcast trimmed array back to each thread
comm.Bcast(array, root=0)

print(rank, array.T)

I am fully aware that the whole purpose to MPI is to distribute the mesh to each node, but i just wondered if there is any way to get this to work with PETScVectors.

If anybody has any idea, i would be happy to read it.

Greetings,
slydex

nate · March 17, 2020, 4:09pm

Why not just assemble what you don’t want to distribute on COMM_SELF and broadcast whenever you need it thereafter? Most dolfin classes take an MPI communicator as an arugment.

Topic		Replies	Views
How to manually assemble a loading vector element-wise in parallel? Linear Algebra	0	406	April 8, 2022
Is it possible to assemble in parallel without partitioning? mesh	0	484	March 2, 2022
Broadcast Dolfin Matrix to MPI Processes	2	721	October 16, 2021
When calling the FEniCS program, there is no parallel or multithreaded execution General mpi	2	39	October 25, 2024
Parallelisation of topology optimisation dolfinx	2	1109	June 13, 2022

Run assemble on each MPI thread individually and broadcast vectors

Related topics