MPI communicators remain blocked

Hi everybody,
I ran into a sudden issue of running out of MPI communicators at some point, when solving a PDE problem millions of times. Basically I think I boiled it down to this MWE you find below.
The MWE illustrates, that over time more communicators are blocked, rather than just one for the function space V.

First, we block 2000 communicators, so we do not have to make so many runs.
Then, for different values of runs, we obtain (on fresh processes each time):
runs = 10, no error
runs = 50, no error
runs = 100, error at j=92: RuntimeError: *** Error: Duplication of MPI communicator failed

Changing the variable comms, will result in a shifted j value.

I hope you can help me out, on how to free the communicators again, so that this error does not occur.

Best regards
Cedric

MWE:

from mpi4py import MPI
import fenics as fn

comms = 2000
runs = 10 #50 #100

# Block some communicators beforehand to see the problem faster
comm = MPI.COMM_WORLD
for i in range(comms):
    print(i, comm.Dup())

# Create function space, that needs an MPI communicator, but should be overwritten
for j in range(runs):
    print(j)
    V = fn.FunctionSpace(fn.UnitIntervalMesh(1), 'CG', 1)

Is there a reason you don’t want to solve the problem on COMM_SELF for each of your processes available?

Thanks for your reply! No there is not a reason. Actually I am not really familiar with MPI and don’t intend to use it besides what the fenics uses it for internally, leading to the error I try to resolve.

When I change the above code to use COMM_SELF:

from mpi4py import MPI
import fenics as fn

comms = 2000
runs = 1000

# Block some communicators beforehand to see the problem faster
comm_self = MPI.COMM_SELF
for i in range(comms):
    print(i, comm_self.Dup())

# Create function space, that needs an MPI communicator, but should be overwritten
for j in range(runs):
    print(j)
    V = fn.FunctionSpace(fn.UnitIntervalMesh(1), 'CG', 1)

then I run into the same problem, that it terminates at some j with the same error:
RuntimeError: *** Error: Duplication of MPI communicator failed

Perhaps you should consider a tutorial.

Also consider the following in the context of FEniCS:

from mpi4py import MPI
import dolfinx

# Mesh on COMM_WORLD distributed over all processes
mesh_world = dolfinx.mesh.create_unit_interval(MPI.COMM_WORLD, 10)
print(f"proc {mesh_world.comm.rank}: mesh_world number of "
	f"cells on process: {mesh_world.topology.index_map(1).size_local}")

# Mesh on COMM_SELF local to each process
mesh_self = dolfinx.mesh.create_unit_interval(MPI.COMM_SELF, 10)
print(f"proc {mesh_self.comm.rank}: mesh_self number of "
	f"cells on process: {mesh_self.topology.index_map(1).size_local}")

which yields for me using 3 MPI processes, e.g., mpirun -np 3 python3 script.py

proc 0: mesh_world number of cells on process: 3
proc 0: mesh_self number of cells on process: 10
proc 1: mesh_world number of cells on process: 3
proc 0: mesh_self number of cells on process: 10
proc 2: mesh_world number of cells on process: 4
proc 0: mesh_self number of cells on process: 10
1 Like

Thanks for sending me the tutorial and your example.
I will work through it to better my understanding of MPI.

Is there an explanation/reason for why the code snippet above terminates with the indicated error when generating function spaces?

In my understanding, when I “block” 2000 communicators on the process, the second for loop creating the function spaces should be able to run indefinitely, because it only needs 1 communicator for the function space, that should be reused for every iteration of the loop. What’s going wrong there, causing the second for loop to terminate before it reaches the end?

When I understand why this MWE throws the error, then I might be able to fix the problem I have in the first place.

The problem is Python’s garbage collection. It does not necessarily destroy objects (and the underlying communicators) when going out of scope or even when explicitly deleting objects with the del command.

As Nate explained, it might be favorable to use MPI.COMM_SELF and using multiple processes with mpirun when executing your code.

If you could provide more details as to why you are solving a PDE million of times on different meshes, or what you are parameterizing over, there might be ways to circumvent the issues.

2 Likes

Thank you very much for your answer. Good to know, that it is a Python garbage problem.

As part of a Multilevel Monte Carlo simulation, I am doing a Monte Carlo simulation for the estimation of e.g. the H1 norm of random PDE solutions.

The MWE below solves the Poisson equation on a unit square domain with a discontinuous coefficient given by a vertical line discontinuity at a random location x = random.uniform(0.2, 08). Left of the line the coefficient value is 0.1, right of the line the value is 10.

The mesh is adapted to the discontinuity, so for each PDE solve I have a different underlying mesh.

I need a very small error in my simulation and thus have to compute a lot of solution samples (in the millions).

In my code I am parallelizing on 6 CPUs over the samples with multiprocessing (mp.pool.starmap). This is not the best choice, since for large degrees of freedom it interferes with the parallelization inside fenics.solve, but still reduces my computational time.

When using mpirun with this MWE, it does not work for the adapted mesh (Error: Unable to extract local mesh data.) and with a non-adapted UnitSquareMesh it executes the same code 6 times.
I will have to rewrite my code, so that it parallelizes over the samples with mpirun properly.

The only other chance I see right now is to split the millions of samples into smaller computable sizes, save them to files and load them later in an extra session to compute the whole sum/average/estimation.

import fenics as fn
import numpy as np
import mshr
from dolfin import MPI

def rand_field_line_strict(position_x):
    tol = 10**(-14)
    return fn.Expression("x[0] < position_x + tol ? low : high", position_x=position_x, low=0.1, high=10, tol=tol, degree=0)

def create_adapted_mesh(position_x, resolution):

    polygon_mshr = mshr.Rectangle(fn.Point(0.0,0.0),fn.Point(1.0,1.0))

    polygon_mshr.set_subdomain(1, mshr.Polygon([fn.Point(0.0, 0.0),\
                                                  fn.Point(position_x,0.0),\
                                                  fn.Point(position_x, 1.0),\
                                                  fn.Point(0.0, 1.0)]))  
        
    return mshr.generate_mesh(polygon_mshr, resolution)

if __name__=="__main__":
    fn.set_log_level(30)

    # Failed at ~2700
    N_block = 1950
    N_compute = 20000
    
    # Block some MPI communicators to run into "RuntimeError" faster
    for i in range(N_block):
        print(i, MPI.comm_self.Dup())
    
    MC_sum = 0
    for i in range(N_compute):
        print(i)
        
        # Draw a uniformly distributed random number for the discontinuity position
        position_x = np.random.uniform(0.2,0.8)
    
        # Generate adapted mesh
        mesh = create_adapted_mesh(position_x, 10) 

        # Set up and solve variational problem
        V = fn.FunctionSpace(mesh, 'CG', 1)
    
        u_D = fn.Expression("0", degree=0)
        bc = fn.DirichletBC(V, u_D, 'on_boundary')
        
        a = rand_field_line_strict(position_x)
        f = fn.Constant(1.0)
        
        u = fn.TrialFunction(V)
        v = fn.TestFunction(V)
        
        lhs = fn.dot(a*fn.grad(u),fn.grad(v))*fn.dx
        rhs = f*v*fn.dx
        
        sol = fn.Function(V)
        fn.solve(lhs==rhs, sol, bc)
        
        # Compute H1 norm
        H1_norm = fn.norm(sol, 'H1')
        
        # Add to MC sum
        MC_sum += H1_norm
        
    # Average
    MC_sum /= N_compute
    
    print("The Monte Carlo estimation yields: ", MC_sum)

Output:

2655
2656
2657
Traceback (most recent call last):
File “MC_estimation_MPI_test.py”, line 59, in
sol = fn.Function(V)
File “/home/beschlcc/.conda/envs/fenicsproject/lib/python3.8/site-packages/dolfin/function/function.py”, line 216, in init
self._cpp_object = cpp.function.Function(V._cpp_object)
RuntimeError:

*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at


*** fenics-support@googlegroups.com


*** Remember to include the error message listed below and, if possible,
*** include a minimal running example to reproduce the error.


*** -------------------------------------------------------------------------
*** Error: Unable to successfully call PETSc function ‘VecCreate’.
*** Reason: PETSc error code is: 1 ((null)).
*** Where: This error was encountered inside /home/conda/feedstock_root/build_artifacts/fenics-pkgs_1582286059742/work/dolfin/dolfin/la/PETScVector.cpp.
*** Process: 0


*** DOLFIN version: 2019.1.0
*** Git changeset: 15b823f2c45d3036b6af931284d0f8e3c77b6845
*** -------------------------------------------------------------------------