Cumulative failures after many MPC construction + SLEPc solve cycles in one process (dolfinx_mpc 0.8.0, DOLFINx 0.8.0, complex PETSc)

I’m computing elastic Bloch band structures using dolfinx_mpc.MultiPointConstraint for periodic boundary conditions. Each k-point builds an MPC with the appropriate Bloch phase, assembles K and M, and solves a generalized Hermitian eigenvalue problem via SLEPc with shift-invert + MUMPS.

After many k-point cycles in a single process (anywhere from ~164 to ~660 depending on configuration), the process crashes. The MPC physics is correct on individual solves — I’ve verified zero-mode counts, Hermiticity, and spectral structure. But three different reproducer configurations give three different failure signatures, all consistent with cumulative resource accumulation at the MPC+SLEPc library boundary. I’d appreciate guidance on the recommended usage pattern.

Environment

DOLFINx:     0.8.0
dolfinx_mpc: 0.8.0.post1 (built from source against above)
petsc4py:    3.21.0
PETSc:       complex128, 32-bit indices (linux-gnu-complex128-32)
Python:      3.10
Base image:  dolfinx/dolfinx:v0.8.0
OS:          Ubuntu 22.04 (container)

Three observed failure modes

Mode 1 (production code): SIGSEGV / free(): invalid pointer at cumulative solve ~164. Backtrace inside SLEPc.EPS.solve. C-level memory corruption — bypasses Python exception handling.

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation
free(): invalid pointer
Abort(59) on node 0 (rank 0 in comm 0)

Mode 2: Minimal reproducer — three sequential create_periodic_constraint_geometrical calls per iteration, SMALLEST_MAGNITUDE eigensolve (no factorization). Crashes at iter ~221 (~663 MPC init calls) with MPI communicator exhaustion:

MPIR_Get_contextid_sparse_group(591): Too many communicators
    (0/2048 free on this process; ignore_id=0)

Mode 3: Same reproducer as Mode 2 but with the production SLEPc config (shift-invert + KSP/PC=preonly/lu + MUMPS, target=0). Crashes at iter 225 with MUMPS analysis workspace exhausted:

[0] MatLUFactorSymbolic_AIJMUMPS() at mumps.c:2358
[0] MUMPS error in analysis: INFOG(1)=-13

Setting mat_mumps_icntl_14=200 (via either code-level setValue or env PETSC_OPTIONS) has no effect — the option appears on PETSc’s “unused” list despite MUMPS clearly running.

What I already tried

Explicit cleanup between iterations:

python

eps_solver.destroy()
K.destroy()
M.destroy()
del mpc
gc.collect()

None of this prevents any of the three failure modes.

What I’ve ruled out for Mode 1

Hypothesis Status Evidence
PETSc options database pollution Ruled out Option count stable at 13 across 220 solves
MUMPS workspace (Mode 1 specifically) Ruled out Mode 1 backtrace is inside EPS.solve, not MUMPS analysis
Python-level PETSc object leak Ruled out gc.get_objects() count stable at 3151 across 210 solves
pytest fixture interaction Ruled out Reproduces identically outside pytest

Modes 2 and 3 confirm that resource accumulation is happening at C/Fortran level, just with library-level signatures that surface before Mode 1’s lower-level SIGSEGV threshold.

Minimal reproducer (Mode 3)

python

import numpy as np
import gc
from mpi4py import MPI
from dolfinx import fem, mesh
from dolfinx_mpc import MultiPointConstraint, assemble_matrix
from petsc4py import PETSc
from slepc4py import SLEPc
import ufl

comm = MPI.COMM_WORLD
msh = mesh.create_unit_square(comm, 8, 8, mesh.CellType.triangle)
V = fem.functionspace(msh, ("Lagrange", 2, (2,)))

E, nu, rho = 2.5e9, 0.34, 1180.0
mu = E / (2 * (1 + nu))
lam = E * nu / ((1 + nu) * (1 - 2 * nu))

def eps_sym(u): return ufl.sym(ufl.grad(u))
def sigma(u): return lam * ufl.tr(eps_sym(u)) * ufl.Identity(2) + 2 * mu * eps_sym(u)

u, v = ufl.TrialFunction(V), ufl.TestFunction(V)
k_form = fem.form(ufl.inner(sigma(u), eps_sym(v)) * ufl.dx)
m_form = fem.form(rho * ufl.inner(u, v) * ufl.dx)

# Production SLEPc options (shift-invert + MUMPS)
opts = PETSc.Options()
opts["eps_type"] = "krylovschur"
opts["eps_target"] = 0.0
opts["st_type"] = "sinvert"
opts["st_shift"] = 0.0
opts["st_ksp_type"] = "preonly"
opts["st_pc_type"] = "lu"
opts["st_pc_factor_mat_solver_type"] = "mumps"
opts["mat_mumps_icntl_14"] = 200

def build_mpc_and_solve(k_vec):
    mpc = MultiPointConstraint(V)
    
    # Three shifts matching supercell pattern (third is dummy no-op here)
    def reloc_a1(x): return np.array([x[0] - 1.0, x[1], x[2]])
    def ind_a1(x): return np.isclose(x[0], 1.0) & (x[1] > 1e-10) & (x[1] < 1 - 1e-10)
    def reloc_a2(x): return np.array([x[0], x[1] - 1.0, x[2]])
    def ind_a2(x): return np.isclose(x[1], 1.0) & (x[0] > 1e-10) & (x[0] < 1 - 1e-10)
    def reloc_a3(x): return np.array([x[0] + 1.0, x[1] - 1.0, x[2]])
    def ind_a3(x): return np.zeros(x.shape[1], dtype=bool)
    
    phase = np.exp(1j * k_vec[0])
    mpc.create_periodic_constraint_geometrical(V, ind_a1, reloc_a1, [], phase)
    mpc.create_periodic_constraint_geometrical(V, ind_a2, reloc_a2, [], phase)
    mpc.create_periodic_constraint_geometrical(V, ind_a3, reloc_a3, [], phase)
    mpc.finalize()
    
    K = assemble_matrix(k_form, mpc)
    M = assemble_matrix(m_form, mpc)
    K.assemble(); M.assemble()
    
    eps_solver = SLEPc.EPS().create(comm)
    eps_solver.setOperators(K, M)
    eps_solver.setProblemType(SLEPc.EPS.ProblemType.GHEP)
    eps_solver.setDimensions(nev=12)
    eps_solver.setFromOptions()
    eps_solver.solve()
    
    nconv = eps_solver.getConverged()
    eigvals = [eps_solver.getEigenvalue(i) for i in range(min(nconv, 12))]
    
    eps_solver.destroy()
    K.destroy()
    M.destroy()
    del mpc
    gc.collect()
    
    return nconv, eigvals

for i in range(250):
    k_vec = np.array([0.1 * (i % 60), 0.0])
    nconv, eigvals = build_mpc_and_solve(k_vec)
    if i % 10 == 0:
        ev0 = eigvals[0] if eigvals else "none"
        print(f"iter {i}: nconv={nconv}, first_ev={ev0}", flush=True)

print("Completed 250 iterations — no crash")

This crashes at iter ~225 with the Mode 3 signature. Mode 2 is the same script with eps_solver.setWhichEigenpairs(SLEPc.EPS.Which.SMALLEST_MAGNITUDE) instead of the shift-invert options block — crashes later (iter ~221 but with no factorization work) with the MPI communicator signature.

Questions

  1. Is there a recommended usage pattern for long k-point sweeps? Specifically:
  • Should MultiPointConstraint be built once per mesh and reused across Bloch phases, rather than rebuilt per k-point? If so, does dolfinx_mpc support changing the Bloch phase after finalize()?
  • Is there a required cleanup sequence beyond destroy() / del / gc.collect() that I’m missing?
  1. Is the MPI communicator leak in Mode 2 a known issue in create_periodic_constraint_geometrical? ~3 communicators per MPC construction × 663 constructions = 2048 limit is a close match.
  2. Could Mode 1’s SIGSEGV be the same communicator leak hitting sooner, due to additional SLEPc internal communicators in Mode 1 bringing the process closer to the 2048 limit before MPC init alone would reach it?
  3. Is MUMPS’s icntl_14=200 option not propagating through my setup? I set it via both PETSc.Options() and PETSC_OPTIONS= env; it appears on PETSc’s “unused database options” list at exit, yet MUMPS runs and eventually fails with INFOG(1)=-13 on analysis workspace.

Happy to share the full production reproducer script (requires a Kekulé hexagonal supercell mesh, ~1000 DOFs) if useful. Any direction appreciated.

This has been fixed in later versions of DOLFINx_MPC (0.10.5 for instance) with PR: