I am also working on a transient simulation where I implement adaptive mesh refinement (AMR) every few time steps. The workflow roughly is:
- Define function space, functions, and variational problems on an initial mesh.
- Solve to obtain the solution, use it for AMR marking and generate a refined mesh.
- On the new mesh, redefine function space, functions, and variational problems.
- Repeat AMR refinement until a stopping criterion is met.
- Pass the final refined mesh to the transient solver for subsequent time steps.
This loop of AMR is called multiple times throughout the transient simulation. However, when I call the AMR routine too frequently (e.g., more than a certain threshold), I get a similar MPI error:
Other MPI error, error stack:
internal_Dist_graph_create_adjacent(125): MPI_Dist_graph_create_adjacent(comm=0xc4005134, indegree=6, ...)
MPIR_Dist_graph_create_adjacent_impl(319):
MPII_Comm_copy(913)......................:
MPIR_Get_contextid_sparse_group(587).....: Cannot allocate context ID because of fragmentation
After searching related discussions, I found similar issues reported here:
https://fenicsproject.discourse.group/t/saving-meshes-in-a-list-runtimeerror-error-duplication-of-mpi-communicator-failed/7749/8
https://github.com/FEniCS/dolfinx/issues/2308
From these, I tentatively understand the issue is caused by repeatedly calling AMR and creating many functions.
In my case, after each AMR step, I only need the refined mesh to continue the transient simulation; the function spaces and functions from previous AMR iterations are no longer used. However, it seems these resources are not correctly freed.
I have considered reducing the frequency of AMR calls to avoid hitting this problem, but this only postpones the error rather than solving it.
I am not sure if my surface understanding of the issue is correct. If there are better solutions or suggestions on how to handle this problem, I would be very grateful.