Okay, I’m almost certain I’ve tracked down the issue, and I believe it has to do with a MPI call in GenericBoundingBoxTree::build()
. Here is the relevant output from my MWE above before hanging when executing with mpiexec -np 4
Process 2: Elapsed wall, usr, sys time: 1.92e-05, 0, 0 ([MixedAssembler] Assemble cells)
Process 1: Computed bounding box tree with 577 nodes for 289 entities.
Process 3: Computed bounding box tree with 37883 nodes for 18942 entities.
Process 0: Computed bounding box tree with 57537 nodes for 28769 entities.
Some observations:
- The code hangs but doesn’t segfault. This led me to think it could be a MPI communicator issue, also this seems to have something to do with
mesh.bounding_box_tree()
- It looks to me like processor 2 (in this case, it has no vertices on
mesh0
) finishes its portion of assembly becauseAssembler:assemble_cells()
returns early if there are no cells to integrate:
// Assembler.cpp line 112
// Assembler::assemble_cells()
// Skip assembly if there are no cell integrals
if (!ufc.form.has_cell_integrals())
return;
- Processes 0,1, and 3 print the
"Computed bounding box..."
message (lines 106-108 ofGenericBoundingBoxTree.cpp
) but don’t print the final message on lines 126-127:"Computed global bounding box..."
. In between, on line 117, there is an MPI call:
// GenericBoundingBoxTree.cpp line 117
// GenericBoundingBoxTree::build()
MPI::all_gather(mesh.mpi_comm(), send_bbox, recv_bbox);
- I believe this MPI call is stuck waiting for processor 2 to communicate, but since processor 2 skips
assemble_cells()
, it never communicates back (I’m assuming the call to create the bounding box happens inassemble_cells()
).
The first thing that comes to mind is simply just computing the bounding box tree before assembly, but this leads to a segfault - there is an earlier thread about bounding_box_tree()
segfaulting when run on a MeshView submesh in parallel and I can confirm this is still the case
Running mesh0.bounding_box_tree()
with my code above will lead to a segfault when run in parallel with n>4 (n=4 seems to be the point on this mesh where the partitioning leads to at least one chunk being independent of the mesh0 submesh).
Here are a couple solutions I thought of:
-
Allow
BoundingBoxTree
to have a “null” case, when it doesn’t contain any entities, and build bounding boxes at some point earlier inAssembler::assemble()
(after confirming a bounding box should be created, but before callingassemble_entity()
, which processors without that entity would skip) -
Change
MPI::all_gather
on line 117 ofGenericBoundingBoxTree
to something likegatherv
? And explicitly state how much data is coming in/out from each processor.
I haven’t tested it but the BoundingBoxTree::create_global_tree()
function on the dolfinx branch looks pretty similar, and it might possibly run into the same issue.