Assembly on mixed domain hangs/segfaults in parallel - dependent on partitioning?

Okay, I’m almost certain I’ve tracked down the issue, and I believe it has to do with a MPI call in GenericBoundingBoxTree::build(). Here is the relevant output from my MWE above before hanging when executing with mpiexec -np 4

Process 2: Elapsed wall, usr, sys time: 1.92e-05, 0, 0 ([MixedAssembler] Assemble cells)
Process 1: Computed bounding box tree with 577 nodes for 289 entities.
Process 3: Computed bounding box tree with 37883 nodes for 18942 entities.
Process 0: Computed bounding box tree with 57537 nodes for 28769 entities.

Some observations:

  • The code hangs but doesn’t segfault. This led me to think it could be a MPI communicator issue, also this seems to have something to do with mesh.bounding_box_tree()
  • It looks to me like processor 2 (in this case, it has no vertices on mesh0) finishes its portion of assembly because Assembler:assemble_cells() returns early if there are no cells to integrate:
// Assembler.cpp line 112
// Assembler::assemble_cells()
// Skip assembly if there are no cell integrals
if (!ufc.form.has_cell_integrals())
  return;
  • Processes 0,1, and 3 print the "Computed bounding box..." message (lines 106-108 of GenericBoundingBoxTree.cpp) but don’t print the final message on lines 126-127: "Computed global bounding box...". In between, on line 117, there is an MPI call:
// GenericBoundingBoxTree.cpp line 117
// GenericBoundingBoxTree::build() 
MPI::all_gather(mesh.mpi_comm(), send_bbox, recv_bbox);
  • I believe this MPI call is stuck waiting for processor 2 to communicate, but since processor 2 skips assemble_cells(), it never communicates back (I’m assuming the call to create the bounding box happens in assemble_cells()).

The first thing that comes to mind is simply just computing the bounding box tree before assembly, but this leads to a segfault - there is an earlier thread about bounding_box_tree() segfaulting when run on a MeshView submesh in parallel and I can confirm this is still the case

Running mesh0.bounding_box_tree() with my code above will lead to a segfault when run in parallel with n>4 (n=4 seems to be the point on this mesh where the partitioning leads to at least one chunk being independent of the mesh0 submesh).

Here are a couple solutions I thought of:

  1. Allow BoundingBoxTree to have a “null” case, when it doesn’t contain any entities, and build bounding boxes at some point earlier in Assembler::assemble() (after confirming a bounding box should be created, but before calling assemble_entity(), which processors without that entity would skip)

  2. Change MPI::all_gather on line 117 of GenericBoundingBoxTree to something like gatherv? And explicitly state how much data is coming in/out from each processor.

I haven’t tested it but the BoundingBoxTree::create_global_tree() function on the dolfinx branch looks pretty similar, and it might possibly run into the same issue.

1 Like