I am running a simulation, that needs to have quick access to previously computed meshes. The currently fastest way to do this for me is to store the meshes in a list. Saving to .xml and loading multiple times takes very long.
However, I am running into an MPI error: RuntimeError: *** Error: Duplication of MPI communicator failed
I am using fenics 2019.1.0 as a conda installation.
Here is an MWE to recreate it:
import fenics as fn
N=10000
mesh_list = []
for i in range(N):
print("No. ", i)
mesh = fn.UnitSquareMesh(4,4)
mesh_list.append(mesh)
It fails at around No. 2044.
Is there a way to save the mesh in the list and clear the MPI communicator it blocks?
I hope you can help me out.
Thank you very much in advance,
Best regards
Cedric
You can only have 2048 MPI communicators on a process, as illustrated with:
from mpi4py import MPI
N = 10000
mesh_list = []
for i in range(N):
print(i, MPI.COMM_WORLD.Dup())
returning
2043 <mpi4py.MPI.Intracomm object at 0x7f3d70d81a80>
2044 <mpi4py.MPI.Intracomm object at 0x7f3d70d81a80>
Traceback (most recent call last):
File "mwe.py", line 6, in <module>
print(i, MPI.COMM_WORLD.Dup())
File "mpi4py/MPI/Comm.pyx", line 134, in mpi4py.MPI.Comm.Dup
A mesh needs an MPI Communicator, as it needs to be able to communicate with the other processes sharing data. The function fn.UnitSquareMesh(4, 4) in turn calls dolfin.UnitSquareMesh(dolfin.MPI.comm_world, 4, 4) which needs a copy of the comm world communicator.
If you could give some more information about your problem, i.e.
Why do you need to store a list of meshes?
Are the meshes the same at every step?
If the meshes are not the same, what varies? Is it the number of cells (connectivity between points) or is it the location of each celll (i.e. same connectivity, but the coordinates are moved).
thank you very much for your fast and helpful answer. Now I know, that there is no workaround by not using a communicator for a mesh.
I am doing a stochastic simulation with adapted meshes and different ‘maximal’ resolutions. Let’s say I compute 10000 samples. Then, for each sample (besides the starting mesh) I have a different refined meshes and I need to store each refined mesh because I only know the ‘maximal’ resolution needed for each sample (mesh) after having computed all 10000 samples.
But you brought me to an idea. Since the starting meshes are the same, I can store the arrays containing the marked cells to refine for each sample and resolution, to generate the refined meshes based on theses marked cells iteratively from the starting mesh each time.
This might be faster than saving and loading the already refined meshes.
When testing this, however, I stumbled across a little peculiarity.
import fenics as fn
import numpy as np
mesh = fn.UnitSquareMesh(4, 4)
print("#Cells: ", mesh.num_cells())
index_marked = np.arange(1,4)
cell_markers = fn.MeshFunction("bool", mesh, mesh.topology().dim())
cell_markers.set_all(False)
cell_markers.array()[index_marked] = True
for i in range(2):
mesh_refined = fn.refine(mesh, cell_markers)
print("#Cells: ", mesh_refined.num_cells())
mesh = fn.Mesh(mesh_refined)
Executing this code multiple times yields different results after the second refinement (even when removing the variables in the kernel or executing the code on different kernels), e.g.:
1.run: #Cells: 32 #Cells: 52 #Cells: 105
2. run: #Cells: 32 #Cells: 52 #Cells: 101
3. run #Cells: 32 #Cells: 52 #Cells: 84
4. run #Cells: 32 #Cells: 52 #Cells: 92
What happens to the numbering of cells in the refined meshes, that this occurs? It does not seem reasonable to me.
Just for completion. My code from above producing different numbers of cells for different runs uses the same cell_markers in each refinement process in the for loop. That’s basically the mistake here, that I did not adapt the cell_markers to the refined mesh for the next loop iteration.
I think I met a similar situation, where my code terminates at around 1,900 steps with the following error output. If I save checkpoint at 1,000 step and restart from that point, the simulation can pass it. I have no direction why it happens. From the error output, if it can be deduced to the MPI duplication problem? In my code, I did not have an explcit call of generating new mesh in every time step.
File "/home/chenyongxin/anaconda3/envs/fenicsx090/lib/python3.13/site-packages/dolfinx/fem/petsc.py", line 831, in __del__
Exception ignored in: <function LinearProblem.__del__ at 0x7fbb026b1580>
Traceback (most recent call last):
File "petsc4py/PETSc/KSP.pyx", line 454, in petsc4py.PETSc.KSP.destroy
self._solver.destroy()
petsc4py.PETSc.Error: error code 76
[0] PetscGarbageCleanup() at /home/conda/feedstock_root/build_artifacts/petsc_1733218859887/work/src/sys/objects/garbage.c:215
[0] PetscCommDuplicate() at /home/conda/feedstock_root/build_artifacts/petsc_1733218859887/work/src/sys/objects/tagm.c:219
[0] General MPI error
[0] MPI error 405363215 Other MPI error, error stack:
internal_Comm_dup(29244)............: MPI_Comm_dup(MPI_COMM_WORLD, newcomm=0x7ffc66b8ae3c) failed
MPIR_Comm_dup_impl(671).............:
MPII_Comm_dup(901)..................:
MPII_Comm_copy(943).................:
MPIR_Get_contextid_sparse_group(569): Cannot allocate context ID because of fragmentation (29/2048 free on this process; ignore_id=0)
Other useful-maybe info:
In every time step, I create many assistant fem.Function to update mesh coordinate, e.g.:
u = fem.Function(V)
u.interpolate(U.sub(0).collapse())
_mesh.geometry.x[:, :ndims] += u.x.array.reshape((-1, ndims))
In every time step, I explicitly assemble matrices and vectors and then solve:
def _assemble_(a, L, bcs:List = []) -> Union[PETSc.Mat, PETSc.Vec]:
A = assemble_matrix(fem.form(a), bcs)
A.assemble()
b = assemble_vector(fem.form(L))
apply_lifting(b, [fem.form(a)], bcs=[bcs])
b.ghostUpdate(addv=PETSc.InsertMode.ADD, mode=PETSc.ScatterMode.REVERSE)
set_bc(b, bcs)
return A, b
for time_step in range(2000):
A0, b0 = _assemble_(a0, b0, bcs0)
# ...
A1, b1 = _assemble_(a1, b1, bcs1)
#...
A2, b2 = _assemble_(a2, b2, bcs2)
A = A0 + A1 + A2
b = b0 + b1 + b2
A.assemble()
solver.setOperators(A)
self.solver.solve(b, U.x.petsc_vec)
[item.destroy() for item in (A, b, A0, b0, A1, b1, A2, b2)]