Assembly on mixed domain hangs/segfaults in parallel - dependent on partitioning?

Hello,

I posted yesterday regarding mixed assembly failing on a particular mesh I had. I originally thought it had something to do with something like normal orientations or degenerate elements, but after some more digging, I realized that the problem is likely in how the mesh is partitioned when calling mpirun (I am using the latest docker container [quay.io/fenicsproject/dev] which uses the SCOTCH partitioner).

I was able to replicate the assembly errors I was getting on my previous mesh using just pure fenics code. We take a unit cube mesh, stretch it a bit, and define a subvolume and subsurface closer to one edge. When I run this code with mpirun -np n python3 code.py with n=[2,3] it runs fine, n=[4,5] causes it to hang until force quitting, and n=6 leads to a segfault.

MWE:

# mixed assembly mwe
import dolfin as d
# Create a mesh
mesh = d.UnitCubeMesh(40,40,40)
z_slice = 0.5 # value of z we use to demaracate the different subvolumes
mesh.coordinates()[:,2] *= 4 # something about partitioning leads to hanging
mesh.init()

# define mesh functions
mf3 = d.MeshFunction("size_t", mesh, 3, 0)
mf2 = d.MeshFunction("size_t", mesh, 2, 0)
for c in d.cells(mesh):
    mf3[c] = ( c.midpoint().z() < z_slice+d.DOLFIN_EPS )*1

for f in d.facets(mesh):
    mf2[f] = (z_slice-d.DOLFIN_EPS <= f.midpoint().z() <= z_slice+d.DOLFIN_EPS)*1

# Create submeshes using MeshView
mesh0 = d.MeshView.create(mf3, 1); mesh_ = d.MeshView.create(mf2, 1)
# build mesh mapping
mesh_.build_mapping(mesh0)
# function space and function
V0 = d.FunctionSpace(mesh0, "CG", 1)
u0 = d.Function(V0)
# measures
dx = d.Measure('dx', domain=mesh)
dx0 = d.Measure('dx', domain=mesh0)
dx_ = d.Measure('dx', domain=mesh_)
# Give u some value
u0.assign(d.Constant(4))

# Show that the submeshes indeed are non-empty
print(d.assemble(1*dx))  # volume of full mesh = 4
print(d.assemble(1*dx0)) # 0.5
print(d.assemble(1*dx_)) # 1

# Try to assemble a function integrated over domain
print(d.assemble(u0*dx0)) # 4*0.5 = 2

# !!! THIS FAILS IN PARALLEL !!!
# for me, mpirun -np n python3 where n=[1,2,3] works fine, but n=[4,5] hangs and n=6 segfaults
print(d.assemble(u0*dx_)) # 4*1 = 4 

I’ve tried tracking down this issue in assembling.py and the specific part that causes it to break is the actual assembly of the dolfin-wrapped form (i.e. assembler.assemble(tensor, dolfin_form)). I plotted the coloring based on rank and the partitioning seems normal, which leads me to believe I think it is failing when every processor doesn’t have a copy of the function being assembled. Here I show the coloring for n=[3,4] (3 works and n>4 hangs). Notice that in n=3 all three colors are on our sub-surface (the inner surface in between the solid and opaque volumes), and in n=4 there is a color that does not touch the sub-surface (cyan, apologies it is a little difficult to see next to the green), and subsequently the assembly fails.

n=3

n=4

Any help would be really appreciated! Being able to solve mixed-dimensional systems in parallel is extremely important for some of our projects :). Thank you so much!

Okay, I’m almost certain I’ve tracked down the issue, and I believe it has to do with a MPI call in GenericBoundingBoxTree::build(). Here is the relevant output from my MWE above before hanging when executing with mpiexec -np 4

Process 2: Elapsed wall, usr, sys time: 1.92e-05, 0, 0 ([MixedAssembler] Assemble cells)
Process 1: Computed bounding box tree with 577 nodes for 289 entities.
Process 3: Computed bounding box tree with 37883 nodes for 18942 entities.
Process 0: Computed bounding box tree with 57537 nodes for 28769 entities.

Some observations:

  • The code hangs but doesn’t segfault. This led me to think it could be a MPI communicator issue, also this seems to have something to do with mesh.bounding_box_tree()
  • It looks to me like processor 2 (in this case, it has no vertices on mesh0) finishes its portion of assembly because Assembler:assemble_cells() returns early if there are no cells to integrate:
// Assembler.cpp line 112
// Assembler::assemble_cells()
// Skip assembly if there are no cell integrals
if (!ufc.form.has_cell_integrals())
  return;
  • Processes 0,1, and 3 print the "Computed bounding box..." message (lines 106-108 of GenericBoundingBoxTree.cpp) but don’t print the final message on lines 126-127: "Computed global bounding box...". In between, on line 117, there is an MPI call:
// GenericBoundingBoxTree.cpp line 117
// GenericBoundingBoxTree::build() 
MPI::all_gather(mesh.mpi_comm(), send_bbox, recv_bbox);
  • I believe this MPI call is stuck waiting for processor 2 to communicate, but since processor 2 skips assemble_cells(), it never communicates back (I’m assuming the call to create the bounding box happens in assemble_cells()).

The first thing that comes to mind is simply just computing the bounding box tree before assembly, but this leads to a segfault - there is an earlier thread about bounding_box_tree() segfaulting when run on a MeshView submesh in parallel and I can confirm this is still the case

Running mesh0.bounding_box_tree() with my code above will lead to a segfault when run in parallel with n>4 (n=4 seems to be the point on this mesh where the partitioning leads to at least one chunk being independent of the mesh0 submesh).

Here are a couple solutions I thought of:

  1. Allow BoundingBoxTree to have a “null” case, when it doesn’t contain any entities, and build bounding boxes at some point earlier in Assembler::assemble() (after confirming a bounding box should be created, but before calling assemble_entity(), which processors without that entity would skip)

  2. Change MPI::all_gather on line 117 of GenericBoundingBoxTree to something like gatherv? And explicitly state how much data is coming in/out from each processor.

I haven’t tested it but the BoundingBoxTree::create_global_tree() function on the dolfinx branch looks pretty similar, and it might possibly run into the same issue.

1 Like

Issue is addressed in: Bitbucket