Error with mpi in transient simulation

Hi,

I am running a transient simulation and somewhere between step number 150-160 I always run into error. Most of the times, I get this message

Other MPI error, error stack:
internal_Dist_graph_create_adjacent(125).: MPI_Dist_graph_create_adjacent(comm=0xc40013ec, indegree=0, sources=(nil), sourceweights=0x7f54d3e9bcf4, outdegree=0, destinations=(nil), destweights=0x7f54d3e9bcf4, MPI_INFO_NULL, reorder=0, comm_dist_graph=0x7ffcc55205fc) failed
MPIR_Dist_graph_create_adjacent_impl(319): 
MPII_Comm_copy(913)......................: 
MPIR_Get_contextid_sparse_group(591).....: Too many communicators (0/2048 free on this process; ignore_id=0)
Abort(4811279) on node 0 (rank 0 in comm 32400): application called MPI_Abort(comm=0xC40013EC, 4811279) - process 0

I have dolfinx 0.7.1 installed through conda on wsl2. The problem is reproducible on 0.6.0 as well. Mostly the error brings down python kernel with it but once it survived to point out that the problem happened at the last line of the following snippet from my code:

force_density = ufl.cross(J_eddy, B_sol) # a 3d vector
    dx = ufl.Measure('dx', mesh, subdomain_data=ct)
    force = np.zeros(2)
    for idx in range(2):
        compiled = dolfinx.fem.form(force_density[idx]*dx(dom_idx))
        force[idx] = mesh.comm.allreduce(dolfinx.fem.assemble_scalar(compiled), op=mpi4py.MPI.SUM)

I am not sure that this is sufficient information for anyone to help but I am taking a chance in case it triggers something. I am at a loss as to what should be the MWE in this situation? I am happy to share the actual code just in case or would be glad to take suggestions on how to formulate the MWE.

Thanks in advance.

You need to make a reproducible example. The fewer lines, the better. However, the code has to be executable so that people can guide you.

In general, for transient simulations, I would have a look at:
http://jsdokken.com/FEniCS23-tutorial/src/form_compilation.html#form-compilation

1 Like

As @dokken says, we will be able to advise more once you make a MWE.

However, when it happened to me in too many MPI_COMM duplications while creating a Function? · Issue #2308 · FEniCS/dolfinx · GitHub, it was because I was creating too many dolfinx.fem.Functions. Creating my_ref_function = dolfinx.fem.Function(V) outside of the loop and then using my_actual_function = my_ref_function.copy() did the trick at the time.

If that turns out to be the case, think carefully if you really need to create all those functions/all those copies. There are few legitimate scenarios in which that is needed, but most of the times it is not necessary, you can just reuse an existing dolfinx.fem.Function.

2 Likes

Thanks @dokken and @francesco-ballarin for your feedback.

My geometry changes at each time-step. This means I have to generate new variables to store previous solution (i.e. call Omega = FunctionSpace(mesh,…) and Function(Omega)) and bring the old solution using interpolation. Further postprocessing steps are in addition to that. Is it possible that old FunctionSpace and Function objects could be recycled on the new mesh?

Regardless, why should it be a problem? As long as the user has resources (which I have), it should be a harmless way of formulating the model. I am curious if there is a better way to handle MPI to prevent this.

I will try to simplify the model and hope the problem survives so that I could share it as an MWE soon.

Does this mean that you remesh at every time step?

If not, does this just mean that you are updating the coordinates in mesh.geometry.x?

This has to do with mpi in Python and garbage collection in Python.

Indeed, I remesh at each time-step. I am trying to have a non-conformal geometry to this problem where I can transform the mesh only in the domain of interest while leaving the rest intact. Yesterday, I made an animation, here it is in case it gives ideas:

tri_movie

I have to rotate specific domains in my geometry. Since the mesh is conformal, I soon end up with distorted facets. In principle, I can remesh only when the distortion is large enough but I thought that was a lot more work to do.

Can we learn how to deal with mpi properly? The work I am doing is still preparatory to overcome challenges that will be faced in a lot more complicated 3D geometry. I would rather not be bugged by these issues then.

Please make a minimal reproducible example along the lines of what @IgorBaratta did in his first snippet in too many MPI_COMM duplications while creating a Function? · Issue #2308 · FEniCS/dolfinx · GitHub

The example should be something like:

create the original mesh

for t in range(A VERY BIG NUMBER):
    print(t)
    remesh
    create a new function space on the new mesh
    create a new function on the new function space

Post the example you came up with, and the value of t at which you get the duplication error.

Thanks for the suggestion, it helped me make a simple MWE which surprisingly reproduces the bug quite easily

import dolfinx, mpi4py, ufl, gmsh, petsc4py, pyvista
import pyvista
from dolfinx.io import gmshio
import numpy as np
import matplotlib.pyplot as plt
import utils
from dolfinx.fem.petsc import LinearProblem
from tqdm.auto import tqdm

def create_geom(ro=50, fov=None, disp=False, write=False, rot_angle=0):
    '''a simple geometry where the inner circle rotates by the specified angle'''
    if fov is None:
        fov = (400, 300)
    fov_x, fov_y = fov
    gmsh.initialize()
    gmsh.option.setNumber("General.Terminal", 0) # disable output messages
    gmsh.clear()
    gdim = 2
    model_rank = 0
    
    t_gap = 5 # thickness of airgap
    
    occ = gmsh.model.occ

    # first we create the outer domain
    outer_box = occ.add_rectangle(-fov_x/2, -fov_y/2, 0, fov_x, fov_y)
    air_shell = occ.addDisk(0, 0, 0, ro+t_gap, ro+t_gap)
    occ.synchronize()
    occ.rotate([(gdim, air_shell)], 0, 0, 0, 0, 0, 1, rot_angle)
    occ.synchronize()
    occ.fragment([(gdim, outer_box)], [(gdim, air_shell)])
    occ.synchronize()
    # number all domains
    all_doms = gmsh.model.getEntities(gdim)
    for j, dom in enumerate(all_doms):
        gmsh.model.addPhysicalGroup(dom[0], [dom[1]], j + 1)  # create the main group/node

    # number all boundaries
    all_edges = gmsh.model.getEntities(gdim - 1)
    for j, edge in enumerate(all_edges):
        gmsh.model.addPhysicalGroup(edge[0], [edge[1]], j + 1)  # create the main group/node

    gmsh.model.mesh.generate(gdim)
    gmsh.model.mesh.refine()
    gmsh.model.mesh.refine()
    if write:
        gmsh.write('geom/mwe_multiphenicsx.msh')
    model_rank = 0
    mesh, ct, ft = gmshio.model_to_mesh(gmsh.model, mpi4py.MPI.COMM_WORLD, model_rank, gdim)
    if disp:
        gmsh.fltk.run() # visible inspection, for debugging
    gmsh.finalize()
    return mesh, ct, ft

from scipy.interpolate import griddata
def interp_dofs(uh_data, uh_target):
    '''return dofs on target function based on the data function'''
    points_data = uh_data.function_space.tabulate_dof_coordinates()[:,:2]
    points_target = uh_target.function_space.tabulate_dof_coordinates()[:,:2]
    return griddata(points_data, uh_data.x.array, points_target)    

t_vec = np.linspace(0,1000, 10000)*1e-3
w0 = 2*np.pi*30 # angular frequency
dt = t_vec[1]-t_vec[0]

# boundary tags
bnd_idx_l = (2, ) # left boundary
bnd_idx_r = (3, ) # right boundary

uh_t = t_vec.size*[[]]
for idx_t in tqdm(range(t_vec.size), total=t_vec.size):
    rot_angle = w0*t_vec[idx_t]
    mesh, ct, ft = create_geom(rot_angle=rot_angle)
    
    Omega = dolfinx.fem.functionspace(mesh, ("Lagrange", 2))
    u = ufl.TrialFunction(Omega)
    v = ufl.TestFunction(Omega)

    uh0 = dolfinx.fem.Function(Omega) # recreate with new mesh to store the previous solution
    if idx_t == 0:
        uh0.x.array[:] = 0.0 # first time step, initialize to zero
    else:
        uh0.x.array[:] = interp_dofs(uh_t[idx_t-1], uh0) # interpolate the old solution to this mesh
        uh0.x.scatter_forward() # is it necessary?

    # some physics
    F = ufl.dot(ufl.grad(u), ufl.grad(v))*ufl.dx - ufl.dot(uh0, v)*ufl.dx
    a = ufl.lhs(F)
    L = ufl.rhs(F)

    # boundary conditions
    dofs_l = dolfinx.fem.locate_dofs_topological(Omega, ft.dim, ft.find(bnd_idx_l[0]))
    dbc_l = dolfinx.fem.dirichletbc(1.0, dofs_l, Omega)

    dofs_r = dolfinx.fem.locate_dofs_topological(Omega, ft.dim, ft.find(bnd_idx_r[0]))
    dbc_r = dolfinx.fem.dirichletbc(0., dofs_r, Omega)

    bcs = [dbc_l, dbc_r]
    problem = LinearProblem(a, L, bcs=bcs, petsc_options={"ksp_type": "preonly", "pc_type": "lu",
                                                        "pc_factor_mat_solver_type": "mumps"})
    uh = problem.solve()
    uh.x.scatter_forward() # why do we need this?

    uh_t[idx_t] = uh.copy() # for next step and postprocessing

I get the crash at 224th iteration. Thanks in advance for the help!

Hopefully it can be reduced to an even more minimal example, I’ll try doing that later on in the day.

Your example is not really minimal. I’ve run it the first time, and it takes 10 minutes to run on my laptop. You can’t assume that a user on this forum will be willing to spend 10 minutes every time they run the script to try to track down the issue (and they will surely run it more than once, unless they knew beforehand what the issue was).

I’ve tried with the following more minimal examples:

  • example 1, which is basically a copy of the example I suggest above
import mpi4py.MPI
import dolfinx

msh = dolfinx.mesh.create_unit_interval(mpi4py.MPI.COMM_WORLD, 10)
V = dolfinx.fem.functionspace(msh, ("Lagrange", 1))

sols = list()
for i in range(2**17):
    print(i)
    u = dolfinx.fem.Function(V)
    sols.append(u)
  • example 2, which is a slightly variation of example 1, and mimics your use case, in which a mesh is created at every time step.
import mpi4py.MPI
import dolfinx

sols = list()
for i in range(2**17):
    print(i)
    msh = dolfinx.mesh.create_unit_interval(mpi4py.MPI.COMM_WORLD, 10)
    V = dolfinx.fem.functionspace(msh, ("Lagrange", 1))
    u = dolfinx.fem.Function(V)
    sols.append(u)

Can you run them on your installation and report, for each of the two examples, the value of i at which you get the mpi failure?

[francesco] If that turns out to be the case, think carefully if you really need to create all those functions/all those copies. There are few legitimate scenarios in which that is needed, but most of the times it is not necessary, you can just reuse an existing dolfinx.fem.Function.

[sbhasan] The work I am doing is still preparatory to overcome challenges that will be faced in a lot more complicated 3D geometry

Anyways, from the example you posted, I don’t see a valid reason for wanting to keep all of the solutions in memory, especially if you plan to use this in 3D. Consider storing only the last one (for time stepping), and saving the rest to file e.g. with GitHub - jorgensd/adios4dolfinx: Interface of ADIOS2 for DOLFINx.

Of course you may still end up with the same issue once you load all of them up for the postprocessing, but let’s decouple the solving part and the post-processing one.

If you have a valid reason that isn’t shown in the snippet, please elaborate.

Thanks for your replies and sorry for the bother. In my laptop, that MWE takes slightly less than 30 seconds which is sort of ok?!

The example 1 that doesn’t remesh goes without a crash. I am not pushing it because 2**17 is practically too large a number for anything I want to do. The second example crashes on 255th iteration. I did one more example in which I remesh but do not save. That too went without trouble:

sols = list()
for i in range(2**17):
    print("Remeshing but not saving the solution -",i)
    msh = dolfinx.mesh.create_unit_interval(mpi4py.MPI.COMM_WORLD, 10)
    V = dolfinx.fem.functionspace(msh, ("Lagrange", 1))
    u = dolfinx.fem.Function(V)
    # sols.append(u)

The original model that led me to this post also works now that I don’t keep results in memory. Having known the cause, I can think of ways to live with it. But to be honest, it could be another mess somewhere down the line if all we are doing is to defer the problem to postprocessing. I remain interested in knowing a way that could prevent the issue itself.

Thanks once again for the help!

I added a post at the end of the issue on github too many MPI_COMM duplications while creating a Function? · Issue #2308 · FEniCS/dolfinx · GitHub reporting this. I am not advocating for the issue to be reopened, but track it there, because future updates (if any) will only be posted there, and not here.

Postprocessing should not be an issue, as long as you load the solutions one at a time. Just keep in mind that the limitation is for dolfinx.fem.Function only: if you want to load the solution, convert it to matplotlib/pyvista format for plotting, and then store all of the converted solutions, you should be able to do so because they will not be in the dolfinx.fem.Function format.

1 Like

Thanks!

It is true that most of the fem problems do not expect the mesh to change. But there are plenty of (advanced?) applications in many domains which require the mesh to be deformable in varying ways. Specifically, when levitation is involved, one has to have all tools under reach. In my case, I admit the right approach is to really use a non-conformal mesh and then manipulate it through mesh.geometry.x[:]. However, there are plenty of situations in which the deformation is such that one simply has to remesh.

I am not claiming that remeshing is uncommon. What I am claiming is uncommon is the intersection between remeshing and the need to store all solutions.

1 Like