Error with mpi in transient simulation

sbhasan · December 31, 2023, 4:59am

Hi,

I am running a transient simulation and somewhere between step number 150-160 I always run into error. Most of the times, I get this message

Other MPI error, error stack:
internal_Dist_graph_create_adjacent(125).: MPI_Dist_graph_create_adjacent(comm=0xc40013ec, indegree=0, sources=(nil), sourceweights=0x7f54d3e9bcf4, outdegree=0, destinations=(nil), destweights=0x7f54d3e9bcf4, MPI_INFO_NULL, reorder=0, comm_dist_graph=0x7ffcc55205fc) failed
MPIR_Dist_graph_create_adjacent_impl(319): 
MPII_Comm_copy(913)......................: 
MPIR_Get_contextid_sparse_group(591).....: Too many communicators (0/2048 free on this process; ignore_id=0)
Abort(4811279) on node 0 (rank 0 in comm 32400): application called MPI_Abort(comm=0xC40013EC, 4811279) - process 0

I have dolfinx 0.7.1 installed through conda on wsl2. The problem is reproducible on 0.6.0 as well. Mostly the error brings down python kernel with it but once it survived to point out that the problem happened at the last line of the following snippet from my code:

force_density = ufl.cross(J_eddy, B_sol) # a 3d vector
    dx = ufl.Measure('dx', mesh, subdomain_data=ct)
    force = np.zeros(2)
    for idx in range(2):
        compiled = dolfinx.fem.form(force_density[idx]*dx(dom_idx))
        force[idx] = mesh.comm.allreduce(dolfinx.fem.assemble_scalar(compiled), op=mpi4py.MPI.SUM)

I am not sure that this is sufficient information for anyone to help but I am taking a chance in case it triggers something. I am at a loss as to what should be the MWE in this situation? I am happy to share the actual code just in case or would be glad to take suggestions on how to formulate the MWE.

Thanks in advance.

dokken · December 31, 2023, 5:57am

You need to make a reproducible example. The fewer lines, the better. However, the code has to be executable so that people can guide you.

In general, for transient simulations, I would have a look at:
http://jsdokken.com/FEniCS23-tutorial/src/form_compilation.html#form-compilation

francesco-ballarin · December 31, 2023, 11:13am

As @dokken says, we will be able to advise more once you make a MWE.

However, when it happened to me in too many MPI_COMM duplications while creating a Function? · Issue #2308 · FEniCS/dolfinx · GitHub, it was because I was creating too many dolfinx.fem.Functions. Creating my_ref_function = dolfinx.fem.Function(V) outside of the loop and then using my_actual_function = my_ref_function.copy() did the trick at the time.

If that turns out to be the case, think carefully if you really need to create all those functions/all those copies. There are few legitimate scenarios in which that is needed, but most of the times it is not necessary, you can just reuse an existing dolfinx.fem.Function.

sbhasan · December 31, 2023, 2:40pm

Thanks @dokken and @francesco-ballarin for your feedback.

My geometry changes at each time-step. This means I have to generate new variables to store previous solution (i.e. call Omega = FunctionSpace(mesh,…) and Function(Omega)) and bring the old solution using interpolation. Further postprocessing steps are in addition to that. Is it possible that old FunctionSpace and Function objects could be recycled on the new mesh?

Regardless, why should it be a problem? As long as the user has resources (which I have), it should be a harmless way of formulating the model. I am curious if there is a better way to handle MPI to prevent this.

I will try to simplify the model and hope the problem survives so that I could share it as an MWE soon.

dokken · December 31, 2023, 3:19pm

Does this mean that you remesh at every time step?

If not, does this just mean that you are updating the coordinates in mesh.geometry.x?

dokken · December 31, 2023, 3:20pm

This has to do with mpi in Python and garbage collection in Python.

sbhasan · December 31, 2023, 3:38pm

Indeed, I remesh at each time-step. I am trying to have a non-conformal geometry to this problem where I can transform the mesh only in the domain of interest while leaving the rest intact. Yesterday, I made an animation, here it is in case it gives ideas:

tri_movie

I have to rotate specific domains in my geometry. Since the mesh is conformal, I soon end up with distorted facets. In principle, I can remesh only when the distortion is large enough but I thought that was a lot more work to do.

sbhasan · December 31, 2023, 3:41pm

Can we learn how to deal with mpi properly? The work I am doing is still preparatory to overcome challenges that will be faced in a lot more complicated 3D geometry. I would rather not be bugged by these issues then.

francesco-ballarin · December 31, 2023, 4:22pm

Please make a minimal reproducible example along the lines of what @IgorBaratta did in his first snippet in too many MPI_COMM duplications while creating a Function? · Issue #2308 · FEniCS/dolfinx · GitHub

The example should be something like:

create the original mesh

for t in range(A VERY BIG NUMBER):
    print(t)
    remesh
    create a new function space on the new mesh
    create a new function on the new function space

Post the example you came up with, and the value of t at which you get the duplication error.

sbhasan · January 1, 2024, 5:32am

Thanks for the suggestion, it helped me make a simple MWE which surprisingly reproduces the bug quite easily

import dolfinx, mpi4py, ufl, gmsh, petsc4py, pyvista
import pyvista
from dolfinx.io import gmshio
import numpy as np
import matplotlib.pyplot as plt
import utils
from dolfinx.fem.petsc import LinearProblem
from tqdm.auto import tqdm

def create_geom(ro=50, fov=None, disp=False, write=False, rot_angle=0):
    '''a simple geometry where the inner circle rotates by the specified angle'''
    if fov is None:
        fov = (400, 300)
    fov_x, fov_y = fov
    gmsh.initialize()
    gmsh.option.setNumber("General.Terminal", 0) # disable output messages
    gmsh.clear()
    gdim = 2
    model_rank = 0
    
    t_gap = 5 # thickness of airgap
    
    occ = gmsh.model.occ

    # first we create the outer domain
    outer_box = occ.add_rectangle(-fov_x/2, -fov_y/2, 0, fov_x, fov_y)
    air_shell = occ.addDisk(0, 0, 0, ro+t_gap, ro+t_gap)
    occ.synchronize()
    occ.rotate([(gdim, air_shell)], 0, 0, 0, 0, 0, 1, rot_angle)
    occ.synchronize()
    occ.fragment([(gdim, outer_box)], [(gdim, air_shell)])
    occ.synchronize()
    # number all domains
    all_doms = gmsh.model.getEntities(gdim)
    for j, dom in enumerate(all_doms):
        gmsh.model.addPhysicalGroup(dom[0], [dom[1]], j + 1)  # create the main group/node

    # number all boundaries
    all_edges = gmsh.model.getEntities(gdim - 1)
    for j, edge in enumerate(all_edges):
        gmsh.model.addPhysicalGroup(edge[0], [edge[1]], j + 1)  # create the main group/node

    gmsh.model.mesh.generate(gdim)
    gmsh.model.mesh.refine()
    gmsh.model.mesh.refine()
    if write:
        gmsh.write('geom/mwe_multiphenicsx.msh')
    model_rank = 0
    mesh, ct, ft = gmshio.model_to_mesh(gmsh.model, mpi4py.MPI.COMM_WORLD, model_rank, gdim)
    if disp:
        gmsh.fltk.run() # visible inspection, for debugging
    gmsh.finalize()
    return mesh, ct, ft

from scipy.interpolate import griddata
def interp_dofs(uh_data, uh_target):
    '''return dofs on target function based on the data function'''
    points_data = uh_data.function_space.tabulate_dof_coordinates()[:,:2]
    points_target = uh_target.function_space.tabulate_dof_coordinates()[:,:2]
    return griddata(points_data, uh_data.x.array, points_target)    

t_vec = np.linspace(0,1000, 10000)*1e-3
w0 = 2*np.pi*30 # angular frequency
dt = t_vec[1]-t_vec[0]

# boundary tags
bnd_idx_l = (2, ) # left boundary
bnd_idx_r = (3, ) # right boundary

uh_t = t_vec.size*[[]]
for idx_t in tqdm(range(t_vec.size), total=t_vec.size):
    rot_angle = w0*t_vec[idx_t]
    mesh, ct, ft = create_geom(rot_angle=rot_angle)
    
    Omega = dolfinx.fem.functionspace(mesh, ("Lagrange", 2))
    u = ufl.TrialFunction(Omega)
    v = ufl.TestFunction(Omega)

    uh0 = dolfinx.fem.Function(Omega) # recreate with new mesh to store the previous solution
    if idx_t == 0:
        uh0.x.array[:] = 0.0 # first time step, initialize to zero
    else:
        uh0.x.array[:] = interp_dofs(uh_t[idx_t-1], uh0) # interpolate the old solution to this mesh
        uh0.x.scatter_forward() # is it necessary?

    # some physics
    F = ufl.dot(ufl.grad(u), ufl.grad(v))*ufl.dx - ufl.dot(uh0, v)*ufl.dx
    a = ufl.lhs(F)
    L = ufl.rhs(F)

    # boundary conditions
    dofs_l = dolfinx.fem.locate_dofs_topological(Omega, ft.dim, ft.find(bnd_idx_l[0]))
    dbc_l = dolfinx.fem.dirichletbc(1.0, dofs_l, Omega)

    dofs_r = dolfinx.fem.locate_dofs_topological(Omega, ft.dim, ft.find(bnd_idx_r[0]))
    dbc_r = dolfinx.fem.dirichletbc(0., dofs_r, Omega)

    bcs = [dbc_l, dbc_r]
    problem = LinearProblem(a, L, bcs=bcs, petsc_options={"ksp_type": "preonly", "pc_type": "lu",
                                                        "pc_factor_mat_solver_type": "mumps"})
    uh = problem.solve()
    uh.x.scatter_forward() # why do we need this?

    uh_t[idx_t] = uh.copy() # for next step and postprocessing

I get the crash at 224th iteration. Thanks in advance for the help!

francesco-ballarin · January 1, 2024, 6:03am

Hopefully it can be reduced to an even more minimal example, I’ll try doing that later on in the day.

francesco-ballarin · January 1, 2024, 10:03am

Your example is not really minimal. I’ve run it the first time, and it takes 10 minutes to run on my laptop. You can’t assume that a user on this forum will be willing to spend 10 minutes every time they run the script to try to track down the issue (and they will surely run it more than once, unless they knew beforehand what the issue was).

I’ve tried with the following more minimal examples:

example 1, which is basically a copy of the example I suggest above

import mpi4py.MPI
import dolfinx

msh = dolfinx.mesh.create_unit_interval(mpi4py.MPI.COMM_WORLD, 10)
V = dolfinx.fem.functionspace(msh, ("Lagrange", 1))

sols = list()
for i in range(2**17):
    print(i)
    u = dolfinx.fem.Function(V)
    sols.append(u)

example 2, which is a slightly variation of example 1, and mimics your use case, in which a mesh is created at every time step.

import mpi4py.MPI
import dolfinx

sols = list()
for i in range(2**17):
    print(i)
    msh = dolfinx.mesh.create_unit_interval(mpi4py.MPI.COMM_WORLD, 10)
    V = dolfinx.fem.functionspace(msh, ("Lagrange", 1))
    u = dolfinx.fem.Function(V)
    sols.append(u)

Can you run them on your installation and report, for each of the two examples, the value of i at which you get the mpi failure?

francesco-ballarin · January 1, 2024, 10:20am

[francesco] If that turns out to be the case, think carefully if you really need to create all those functions/all those copies. There are few legitimate scenarios in which that is needed, but most of the times it is not necessary, you can just reuse an existing dolfinx.fem.Function.

[sbhasan] The work I am doing is still preparatory to overcome challenges that will be faced in a lot more complicated 3D geometry

Anyways, from the example you posted, I don’t see a valid reason for wanting to keep all of the solutions in memory, especially if you plan to use this in 3D. Consider storing only the last one (for time stepping), and saving the rest to file e.g. with GitHub - jorgensd/adios4dolfinx: Interface of ADIOS2 for DOLFINx.

Of course you may still end up with the same issue once you load all of them up for the postprocessing, but let’s decouple the solving part and the post-processing one.

If you have a valid reason that isn’t shown in the snippet, please elaborate.

sbhasan · January 1, 2024, 3:45pm

Thanks for your replies and sorry for the bother. In my laptop, that MWE takes slightly less than 30 seconds which is sort of ok?!

The example 1 that doesn’t remesh goes without a crash. I am not pushing it because 2**17 is practically too large a number for anything I want to do. The second example crashes on 255th iteration. I did one more example in which I remesh but do not save. That too went without trouble:

sols = list()
for i in range(2**17):
    print("Remeshing but not saving the solution -",i)
    msh = dolfinx.mesh.create_unit_interval(mpi4py.MPI.COMM_WORLD, 10)
    V = dolfinx.fem.functionspace(msh, ("Lagrange", 1))
    u = dolfinx.fem.Function(V)
    # sols.append(u)

The original model that led me to this post also works now that I don’t keep results in memory. Having known the cause, I can think of ways to live with it. But to be honest, it could be another mess somewhere down the line if all we are doing is to defer the problem to postprocessing. I remain interested in knowing a way that could prevent the issue itself.

Thanks once again for the help!

francesco-ballarin · January 1, 2024, 4:25pm

I added a post at the end of the issue on github too many MPI_COMM duplications while creating a Function? · Issue #2308 · FEniCS/dolfinx · GitHub reporting this. I am not advocating for the issue to be reopened, but track it there, because future updates (if any) will only be posted there, and not here.

Postprocessing should not be an issue, as long as you load the solutions one at a time. Just keep in mind that the limitation is for dolfinx.fem.Function only: if you want to load the solution, convert it to matplotlib/pyvista format for plotting, and then store all of the converted solutions, you should be able to do so because they will not be in the dolfinx.fem.Function format.

sbhasan · January 1, 2024, 4:40pm

Thanks!

It is true that most of the fem problems do not expect the mesh to change. But there are plenty of (advanced?) applications in many domains which require the mesh to be deformable in varying ways. Specifically, when levitation is involved, one has to have all tools under reach. In my case, I admit the right approach is to really use a non-conformal mesh and then manipulate it through mesh.geometry.x[:]. However, there are plenty of situations in which the deformation is such that one simply has to remesh.

francesco-ballarin · January 1, 2024, 4:48pm

I am not claiming that remeshing is uncommon. What I am claiming is uncommon is the intersection between remeshing and the need to store all solutions.

nikoleeaa · May 28, 2025, 7:02pm

I am also working on a transient simulation where I implement adaptive mesh refinement (AMR) every few time steps. The workflow roughly is:

Define function space, functions, and variational problems on an initial mesh.
Solve to obtain the solution, use it for AMR marking and generate a refined mesh.
On the new mesh, redefine function space, functions, and variational problems.
Repeat AMR refinement until a stopping criterion is met.
Pass the final refined mesh to the transient solver for subsequent time steps.

This loop of AMR is called multiple times throughout the transient simulation. However, when I call the AMR routine too frequently (e.g., more than a certain threshold), I get a similar MPI error:

Other MPI error, error stack:
internal_Dist_graph_create_adjacent(125): MPI_Dist_graph_create_adjacent(comm=0xc4005134, indegree=6, ...)
MPIR_Dist_graph_create_adjacent_impl(319): 
MPII_Comm_copy(913)......................: 
MPIR_Get_contextid_sparse_group(587).....:  Cannot allocate context ID because of fragmentation

After searching related discussions, I found similar issues reported here:

https://fenicsproject.discourse.group/t/saving-meshes-in-a-list-runtimeerror-error-duplication-of-mpi-communicator-failed/7749/8
https://github.com/FEniCS/dolfinx/issues/2308

From these, I tentatively understand the issue is caused by repeatedly calling AMR and creating many functions.

In my case, after each AMR step, I only need the refined mesh to continue the transient simulation; the function spaces and functions from previous AMR iterations are no longer used. However, it seems these resources are not correctly freed.

I have considered reducing the frequency of AMR calls to avoid hitting this problem, but this only postpones the error rather than solving it.

I am not sure if my surface understanding of the issue is correct. If there are better solutions or suggestions on how to handle this problem, I would be very grateful.

dokken · May 28, 2025, 7:24pm

As stated within: MPI_Comm_dup

In particular, no library routine should use MPI_COMM_WORLD as the communicator; instead, a duplicate of a user-specified communicator should always be used. For more information, see Using MPI, 2nd edition.

This means that whenever a new mesh, function space (etc.) is created, we should not use MPI.COMM_WORLD, but a copy of that communicator.

If you look at:

it states

We ran some experiments on the IBM Blue Gene/P at Argonne to measure the memory overheads of creating new communicators. Figure 2 shows the results of a simple experiment to determine, for diﬀerent numbers of processes, how many communicators can be created by calling MPI Comm dup of MPI COMM WORLD in a loop until it fails. Note that the maximum number of communicators supported by the implementation by default is 8189 (independent of MPI Comm dup) because of a limit on the number of available context ids. With the default settings, the number of new communicators that can be created drops sharply starting at about 2048 processes.

I guess for AMR, we should consider if we can pass down communicators, rather than creating new a new one for each mesh. Pinging @jackhale for his opinion.

nikoleeaa · May 28, 2025, 8:39pm

Hi @dokken, thanks a lot for your quick reply!

I’m pretty new to MPI and low-level computing, so I’m still wrapping my head around it. From what you said, I understand that in fenicsx, whenever I use APIs like create_rectangle, create_box, or maybe even refine (though I’m not sure if refine takes a communicator), or when importing meshes from IO, each time a new mesh is created, a new MPI communicator is also created internally. When too many accumulate, the MPI context ID limit gets hit, causing the error.

Regarding my specific issue, I think the easiest thing for me right now is to try to reduce how often I do this, is that correct? I don’t know what to do, or what low-level changes I could make to fix this issue.

Topic		Replies	Views
Duplication of MPI communicator error	1	772	November 8, 2020
Strange problem about generating the mesh in DOLFINX in parallel mesh	2	796	November 30, 2021
Instability of mesh read with HPC I/O	17	467	February 26, 2024
Fnics Solid Mechanics MPI Comm error	1	320	July 19, 2021
Mpi communication with discontinuous Galerkin method for diffusion General	3	527	November 16, 2022

Error with mpi in transient simulation

Related topics