Hi everybody,
the problem i am dealing with, involves 400+ equations and therefore puts up quite a challenge to optimize. A part of the optimization currently is to 'pre-'assemble parts of the equations, that are constant and thus can be reused in subsequent iterations. Even though this improved the solution time significantly, i still feel that these precomputations could be done more effectively.
As FEniCS does the assembly in parallel, but I rather tend to have a lot of equations and relatively small number of dof, the memory communication seems to dominate the process and causes bad scaling with additional MPI threads.
I wondered if there would be a way to do parts of the assembly on the entire domain and later distribute it back to each thread. Like this dummy code i wrote with numpy arrays (that i couldn’t manage to get to work with PETScVectors):
from dolfin import *
from mpi4py import MPI as pyMPI
import numpy as np
#MPI stuff
comm = MPI.comm_world
size = comm.size
rank = comm.rank
#arbitrary array size
p=10
#size per MPI thread
chunk = int(np.ceil(p/size))
#create array on root
array = None
if rank == 0:
array = np.ones(size*chunk, dtype='d')*-1
#split 'tasks' equally among MPI threads
foo = np.array_split(range(p),size)[rank]
#init buffer
bar = np.ones(chunk, dtype='d')*-1
#do stuff locally on each thread
for i,v in enumerate(foo):
bar[i] = rank #just to see some result
# gather on root node
comm.Gather(bar, array, root=0)
#drop uninitialized data from each chunk
if rank == 0: array = np.asarray([v for v in array if v != -1])
else: array = np.empty(p, dtype='d')
#broadcast trimmed array back to each thread
comm.Bcast(array, root=0)
print(rank, array.T)
I am fully aware that the whole purpose to MPI is to distribute the mesh to each node, but i just wondered if there is any way to get this to work with PETScVectors.
If anybody has any idea, i would be happy to read it.
Greetings,
slydex