MPI communicators remain blocked

beschlcc · April 8, 2022, 11:20am

Thank you very much for your answer. Good to know, that it is a Python garbage problem.

As part of a Multilevel Monte Carlo simulation, I am doing a Monte Carlo simulation for the estimation of e.g. the H1 norm of random PDE solutions.

The MWE below solves the Poisson equation on a unit square domain with a discontinuous coefficient given by a vertical line discontinuity at a random location x = random.uniform(0.2, 08). Left of the line the coefficient value is 0.1, right of the line the value is 10.

The mesh is adapted to the discontinuity, so for each PDE solve I have a different underlying mesh.

I need a very small error in my simulation and thus have to compute a lot of solution samples (in the millions).

In my code I am parallelizing on 6 CPUs over the samples with multiprocessing (mp.pool.starmap). This is not the best choice, since for large degrees of freedom it interferes with the parallelization inside fenics.solve, but still reduces my computational time.

When using mpirun with this MWE, it does not work for the adapted mesh (Error: Unable to extract local mesh data.) and with a non-adapted UnitSquareMesh it executes the same code 6 times.
I will have to rewrite my code, so that it parallelizes over the samples with mpirun properly.

The only other chance I see right now is to split the millions of samples into smaller computable sizes, save them to files and load them later in an extra session to compute the whole sum/average/estimation.

import fenics as fn
import numpy as np
import mshr
from dolfin import MPI

def rand_field_line_strict(position_x):
    tol = 10**(-14)
    return fn.Expression("x[0] < position_x + tol ? low : high", position_x=position_x, low=0.1, high=10, tol=tol, degree=0)

def create_adapted_mesh(position_x, resolution):

    polygon_mshr = mshr.Rectangle(fn.Point(0.0,0.0),fn.Point(1.0,1.0))

    polygon_mshr.set_subdomain(1, mshr.Polygon([fn.Point(0.0, 0.0),\
                                                  fn.Point(position_x,0.0),\
                                                  fn.Point(position_x, 1.0),\
                                                  fn.Point(0.0, 1.0)]))  
        
    return mshr.generate_mesh(polygon_mshr, resolution)

if __name__=="__main__":
    fn.set_log_level(30)

    # Failed at ~2700
    N_block = 1950
    N_compute = 20000
    
    # Block some MPI communicators to run into "RuntimeError" faster
    for i in range(N_block):
        print(i, MPI.comm_self.Dup())
    
    MC_sum = 0
    for i in range(N_compute):
        print(i)
        
        # Draw a uniformly distributed random number for the discontinuity position
        position_x = np.random.uniform(0.2,0.8)
    
        # Generate adapted mesh
        mesh = create_adapted_mesh(position_x, 10) 

        # Set up and solve variational problem
        V = fn.FunctionSpace(mesh, 'CG', 1)
    
        u_D = fn.Expression("0", degree=0)
        bc = fn.DirichletBC(V, u_D, 'on_boundary')
        
        a = rand_field_line_strict(position_x)
        f = fn.Constant(1.0)
        
        u = fn.TrialFunction(V)
        v = fn.TestFunction(V)
        
        lhs = fn.dot(a*fn.grad(u),fn.grad(v))*fn.dx
        rhs = f*v*fn.dx
        
        sol = fn.Function(V)
        fn.solve(lhs==rhs, sol, bc)
        
        # Compute H1 norm
        H1_norm = fn.norm(sol, 'H1')
        
        # Add to MC sum
        MC_sum += H1_norm
        
    # Average
    MC_sum /= N_compute
    
    print("The Monte Carlo estimation yields: ", MC_sum)

Output:

2655
2656
2657
Traceback (most recent call last):
File “MC_estimation_MPI_test.py”, line 59, in
sol = fn.Function(V)
File “/home/beschlcc/.conda/envs/fenicsproject/lib/python3.8/site-packages/dolfin/function/function.py”, line 216, in init
self._cpp_object = cpp.function.Function(V._cpp_object)
RuntimeError:

*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at

*** fenics-support@googlegroups.com

*** Remember to include the error message listed below and, if possible,
*** include a minimal running example to reproduce the error.

*** -------------------------------------------------------------------------
*** Error: Unable to successfully call PETSc function ‘VecCreate’.
*** Reason: PETSc error code is: 1 ((null)).
*** Where: This error was encountered inside /home/conda/feedstock_root/build_artifacts/fenics-pkgs_1582286059742/work/dolfin/dolfin/la/PETScVector.cpp.
*** Process: 0

*** DOLFIN version: 2019.1.0
*** Git changeset: 15b823f2c45d3036b6af931284d0f8e3c77b6845
*** -------------------------------------------------------------------------

Topic		Replies	Views
Error with mpi in transient simulation dolfinx mpi	23	537	June 20, 2025
Saving meshes in a list: RuntimeError: *** Error: Duplication of MPI communicator failed mesh	7	922	March 4, 2025
Duplication of MPI communicator error	1	777	November 8, 2020
Fnics Solid Mechanics MPI Comm error	1	321	July 19, 2021
Reducing Number of MPI Communicators in Storing List of Solutions at Different Time Steps Legacy DOLFIN mpi	11	294	January 19, 2024

MPI communicators remain blocked

Related topics