Saving meshes in a list: RuntimeError: *** Error: Duplication of MPI communicator failed

Hi everybody,

I am running a simulation, that needs to have quick access to previously computed meshes. The currently fastest way to do this for me is to store the meshes in a list. Saving to .xml and loading multiple times takes very long.

However, I am running into an MPI error: RuntimeError: *** Error: Duplication of MPI communicator failed
I am using fenics 2019.1.0 as a conda installation.

Here is an MWE to recreate it:

import fenics as fn

N=10000
mesh_list = []
for i in range(N):
    print("No. ", i)
    mesh = fn.UnitSquareMesh(4,4)
    mesh_list.append(mesh)

It fails at around No. 2044.
Is there a way to save the mesh in the list and clear the MPI communicator it blocks?

I hope you can help me out.
Thank you very much in advance,
Best regards
Cedric

You can only have 2048 MPI communicators on a process, as illustrated with:

from mpi4py import MPI
N = 10000
mesh_list = []

for i in range(N):
    print(i, MPI.COMM_WORLD.Dup())

returning

2043 <mpi4py.MPI.Intracomm object at 0x7f3d70d81a80>
2044 <mpi4py.MPI.Intracomm object at 0x7f3d70d81a80>
Traceback (most recent call last):
  File "mwe.py", line 6, in <module>
    print(i, MPI.COMM_WORLD.Dup())
  File "mpi4py/MPI/Comm.pyx", line 134, in mpi4py.MPI.Comm.Dup

A mesh needs an MPI Communicator, as it needs to be able to communicate with the other processes sharing data. The function fn.UnitSquareMesh(4, 4) in turn calls dolfin.UnitSquareMesh(dolfin.MPI.comm_world, 4, 4) which needs a copy of the comm world communicator.

If you could give some more information about your problem, i.e.

  • Why do you need to store a list of meshes?
  • Are the meshes the same at every step?
  • If the meshes are not the same, what varies? Is it the number of cells (connectivity between points) or is it the location of each celll (i.e. same connectivity, but the coordinates are moved).
1 Like

Hi,

thank you very much for your fast and helpful answer. Now I know, that there is no workaround by not using a communicator for a mesh.

I am doing a stochastic simulation with adapted meshes and different ‘maximal’ resolutions. Let’s say I compute 10000 samples. Then, for each sample (besides the starting mesh) I have a different refined meshes and I need to store each refined mesh because I only know the ‘maximal’ resolution needed for each sample (mesh) after having computed all 10000 samples.

But you brought me to an idea. Since the starting meshes are the same, I can store the arrays containing the marked cells to refine for each sample and resolution, to generate the refined meshes based on theses marked cells iteratively from the starting mesh each time.

This might be faster than saving and loading the already refined meshes.

When testing this, however, I stumbled across a little peculiarity.

import fenics as fn
import numpy as np

mesh = fn.UnitSquareMesh(4, 4)
print("#Cells: ", mesh.num_cells())

index_marked = np.arange(1,4)
cell_markers = fn.MeshFunction("bool", mesh, mesh.topology().dim())
cell_markers.set_all(False)
cell_markers.array()[index_marked] = True

for i in range(2):
    mesh_refined =  fn.refine(mesh, cell_markers)
    print("#Cells: ", mesh_refined.num_cells())
    mesh = fn.Mesh(mesh_refined)

Executing this code multiple times yields different results after the second refinement (even when removing the variables in the kernel or executing the code on different kernels), e.g.:
1.run:
#Cells: 32
#Cells: 52
#Cells: 105
2. run:
#Cells: 32
#Cells: 52
#Cells: 101
3. run
#Cells: 32
#Cells: 52
#Cells: 84
4. run
#Cells: 32
#Cells: 52
#Cells: 92

What happens to the numbering of cells in the refined meshes, that this occurs? It does not seem reasonable to me.

Best regards
Cedric

I cannot reproduce this using:

docker run -ti -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v $(pwd):/root/shared -w /root/shared --rm quay.io/fenicsproject/dev:latest

Thanks for trying it out and telling me. I will make some more investigations and see if I can find a reason.

Just for completion. My code from above producing different numbers of cells for different runs uses the same cell_markers in each refinement process in the for loop. That’s basically the mistake here, that I did not adapt the cell_markers to the refined mesh for the next loop iteration.