Saving meshes in a list: RuntimeError: *** Error: Duplication of MPI communicator failed

beschlcc · February 28, 2022, 9:15am

Hi everybody,

I am running a simulation, that needs to have quick access to previously computed meshes. The currently fastest way to do this for me is to store the meshes in a list. Saving to .xml and loading multiple times takes very long.

However, I am running into an MPI error: RuntimeError: *** Error: Duplication of MPI communicator failed
I am using fenics 2019.1.0 as a conda installation.

Here is an MWE to recreate it:

import fenics as fn

N=10000
mesh_list = []
for i in range(N):
    print("No. ", i)
    mesh = fn.UnitSquareMesh(4,4)
    mesh_list.append(mesh)

It fails at around No. 2044.
Is there a way to save the mesh in the list and clear the MPI communicator it blocks?

I hope you can help me out.
Thank you very much in advance,
Best regards
Cedric

dokken · February 28, 2022, 10:27am

You can only have 2048 MPI communicators on a process, as illustrated with:

from mpi4py import MPI
N = 10000
mesh_list = []

for i in range(N):
    print(i, MPI.COMM_WORLD.Dup())

returning

2043 <mpi4py.MPI.Intracomm object at 0x7f3d70d81a80>
2044 <mpi4py.MPI.Intracomm object at 0x7f3d70d81a80>
Traceback (most recent call last):
  File "mwe.py", line 6, in <module>
    print(i, MPI.COMM_WORLD.Dup())
  File "mpi4py/MPI/Comm.pyx", line 134, in mpi4py.MPI.Comm.Dup

A mesh needs an MPI Communicator, as it needs to be able to communicate with the other processes sharing data. The function fn.UnitSquareMesh(4, 4) in turn calls dolfin.UnitSquareMesh(dolfin.MPI.comm_world, 4, 4) which needs a copy of the comm world communicator.

If you could give some more information about your problem, i.e.

Why do you need to store a list of meshes?
Are the meshes the same at every step?
If the meshes are not the same, what varies? Is it the number of cells (connectivity between points) or is it the location of each celll (i.e. same connectivity, but the coordinates are moved).

beschlcc · March 1, 2022, 9:51am

Hi,

thank you very much for your fast and helpful answer. Now I know, that there is no workaround by not using a communicator for a mesh.

I am doing a stochastic simulation with adapted meshes and different ‘maximal’ resolutions. Let’s say I compute 10000 samples. Then, for each sample (besides the starting mesh) I have a different refined meshes and I need to store each refined mesh because I only know the ‘maximal’ resolution needed for each sample (mesh) after having computed all 10000 samples.

But you brought me to an idea. Since the starting meshes are the same, I can store the arrays containing the marked cells to refine for each sample and resolution, to generate the refined meshes based on theses marked cells iteratively from the starting mesh each time.

This might be faster than saving and loading the already refined meshes.

When testing this, however, I stumbled across a little peculiarity.

import fenics as fn
import numpy as np

mesh = fn.UnitSquareMesh(4, 4)
print("#Cells: ", mesh.num_cells())

index_marked = np.arange(1,4)
cell_markers = fn.MeshFunction("bool", mesh, mesh.topology().dim())
cell_markers.set_all(False)
cell_markers.array()[index_marked] = True

for i in range(2):
    mesh_refined =  fn.refine(mesh, cell_markers)
    print("#Cells: ", mesh_refined.num_cells())
    mesh = fn.Mesh(mesh_refined)

Executing this code multiple times yields different results after the second refinement (even when removing the variables in the kernel or executing the code on different kernels), e.g.:
1.run:
#Cells: 32
#Cells: 52
#Cells: 105
2. run:
#Cells: 32
#Cells: 52
#Cells: 101
3. run
#Cells: 32
#Cells: 52
#Cells: 84
4. run
#Cells: 32
#Cells: 52
#Cells: 92

What happens to the numbering of cells in the refined meshes, that this occurs? It does not seem reasonable to me.

Best regards
Cedric

dokken · March 1, 2022, 3:31pm

I cannot reproduce this using:

docker run -ti -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v $(pwd):/root/shared -w /root/shared --rm quay.io/fenicsproject/dev:latest

beschlcc · March 1, 2022, 3:45pm

Thanks for trying it out and telling me. I will make some more investigations and see if I can find a reason.

beschlcc · May 4, 2022, 2:20pm

Just for completion. My code from above producing different numbers of cells for different runs uses the same cell_markers in each refinement process in the for loop. That’s basically the mistake here, that I did not adapt the cell_markers to the refined mesh for the next loop iteration.

chenyongxin · March 4, 2025, 9:46am

I think I met a similar situation, where my code terminates at around 1,900 steps with the following error output. If I save checkpoint at 1,000 step and restart from that point, the simulation can pass it. I have no direction why it happens. From the error output, if it can be deduced to the MPI duplication problem? In my code, I did not have an explcit call of generating new mesh in every time step.

File "/home/chenyongxin/anaconda3/envs/fenicsx090/lib/python3.13/site-packages/dolfinx/fem/petsc.py", line 831, in __del__
Exception ignored in: <function LinearProblem.__del__ at 0x7fbb026b1580>
Traceback (most recent call last):

File "petsc4py/PETSc/KSP.pyx", line 454, in petsc4py.PETSc.KSP.destroy
  self._solver.destroy()

petsc4py.PETSc.Error: error code 76
[0] PetscGarbageCleanup() at /home/conda/feedstock_root/build_artifacts/petsc_1733218859887/work/src/sys/objects/garbage.c:215
[0] PetscCommDuplicate() at /home/conda/feedstock_root/build_artifacts/petsc_1733218859887/work/src/sys/objects/tagm.c:219
[0] General MPI error
[0] MPI error 405363215 Other MPI error, error stack:
                internal_Comm_dup(29244)............: MPI_Comm_dup(MPI_COMM_WORLD, newcomm=0x7ffc66b8ae3c) failed
                MPIR_Comm_dup_impl(671).............: 
                MPII_Comm_dup(901)..................: 
                MPII_Comm_copy(943).................: 
                MPIR_Get_contextid_sparse_group(569):  Cannot allocate context ID because of fragmentation (29/2048 free on this process; ignore_id=0)

Other useful-maybe info:

In every time step, I create many assistant fem.Function to update mesh coordinate, e.g.:

u = fem.Function(V)
u.interpolate(U.sub(0).collapse())
_mesh.geometry.x[:, :ndims] += u.x.array.reshape((-1, ndims))

In every time step, I explicitly assemble matrices and vectors and then solve:

def _assemble_(a, L, bcs:List = []) -> Union[PETSc.Mat, PETSc.Vec]:
    A = assemble_matrix(fem.form(a), bcs)
    A.assemble()
    b = assemble_vector(fem.form(L))
    apply_lifting(b, [fem.form(a)], bcs=[bcs])
    b.ghostUpdate(addv=PETSc.InsertMode.ADD, mode=PETSc.ScatterMode.REVERSE) 
    set_bc(b, bcs)
    return A, b

for time_step in range(2000):
	A0, b0 = _assemble_(a0, b0, bcs0)
	# ...
	A1, b1 = _assemble_(a1, b1, bcs1)
	#...
	A2, b2 = _assemble_(a2, b2, bcs2)

	A = A0 + A1 + A2
	b = b0 + b1 + b2
    A.assemble()
	solver.setOperators(A)
    self.solver.solve(b, U.x.petsc_vec)

    [item.destroy() for item in (A, b, A0, b0, A1, b1, A2, b2)]

dokken · March 4, 2025, 12:46pm

Generating a mesh at each time step might be problematic, as MPI only allows for a limited number of duplicate-calls within each simulation.

Without the actual code you are running, it is hard to give any guidance on how to avoid it.

Topic		Replies	Views
Duplication of MPI communicator error	1	769	November 8, 2020
MPI communicators remain blocked Errors	7	1090	April 8, 2022
Reducing Number of MPI Communicators in Storing List of Solutions at Different Time Steps Legacy DOLFIN mpi	11	282	January 19, 2024
Assign communicator to mesh mesh mpi	0	120	April 28, 2024
Mesh.create_mesh runtime error: A facet is connected to more than two cells dolfinx dolfinx , error	3	80	June 6, 2024

Saving meshes in a list: RuntimeError: *** Error: Duplication of MPI communicator failed

Related topics