Discontinuities at MPI partition boundaries with manual periodic MPC on large high-order curved mesh

Hi,

I’m running a Poisson problem with periodic BCs in two directions on a high-order Lagrange space, and I’m seeing discontinuities (“lightning”) in the solution that align with MPI partition boundaries. Single-rank produces a smooth solution; running with 4 or 16 ranks introduces visible jumps. I’ve isolated the problem extensively and ruled out the standard suspects. A minimal reproducer (attached) on a unit cube does not exhibit the issue, so the question is what additional aspect of my real setup triggers it.

Setup

  • dolfinx 0.10, dolfinx_mpc 0.10.0 (contains PR #155 backsubstitution fix)
  • P4 Lagrange on a curved hex27 mesh from a turbine cascade CFD (PyFR), ~17M DOFs
  • Periodic in two translation directions (pitch + span)
  • Inlet/outlet: homogeneous Neumann
  • Single Dirichlet pin on a non-slave DOF to break the constant null space
  • Solver: dolfinx_mpc.LinearProblem with CG + bjacobi + icc

MPC construction (manual)

# slaves: locally owned dofs on x_max and y_max periodic boundaries
# masters: collected via allgather of all owned dofs near x_min OR y_min,
#          matched per-translation with KDTree
slaves_arr   = np.array(slaves_list, dtype=np.int32)   # local indices
masters_arr  = np.array(masters_list, dtype=np.int64)  # global indices
owners_arr   = np.array(owners_list, dtype=np.int32)
coeffs_arr   = np.ones(len(slaves_arr), dtype=np.float64)
offsets_arr  = np.arange(len(slaves_arr) + 1, dtype=np.int32)

mpc = dolfinx_mpc.MultiPointConstraint(V)
mpc.add_constraint(V, slaves_arr, masters_arr, coeffs_arr, owners_arr, offsets_arr)
mpc.finalize()

Master/slave detection reports max distance = 2.5e-14, 0 misses. Total 198,711 slaves after finalize.

Solve sequence (per dokken’s recommendation from thread 13241)

psi_h = problem.solve()
psi_h.x.petsc_vec.ghostUpdate(addv=PETSc.InsertMode.INSERT, mode=PETSc.ScatterMode.FORWARD)
mpc.backsubstitution(psi_h)

What I’ve verified

All checks below pass on every rank in parallel:

  1. psi_h.function_space is mpc.function_space returns True.

  2. For every local slave (all 198,711 across ranks), I compute

    max_i | psi_h[slave_i] - sum_j coeff_ij * psi_h[master_ij] |
    

    Result: 0.000e+00 on every rank. Backsubstitution is bit-exact.

  3. For 1000 ghost DOFs per rank, I compare the ghost value to the corresponding owned value on the owning rank (looked up via KDTree on allgathered owned coordinates). Result: 0.000e+00 on every rank. Ghost synchronization is bit-exact.

  4. Writing a Function(V) filled with 1.0 to a VTKFile and viewing in ParaView shows a perfectly smooth constant field — no partition-boundary artifacts in the output pipeline.

  5. Interpolating psi_h to a P1 space and writing that also shows the same discontinuities, ruling out high-order Lagrange visualization artifacts.

  6. The source term (interpolated from CFD data) is smooth in the visualization — no discontinuities. The discontinuities only appear in psi_h.

  7. The matrix diagonal is checked: slave rows have diagonal 1.000e+00 ± 1e-3 on all ranks. Non-slave diagonals vary across ranks but only in proportion to local cell size (METIS variation), which is expected.

  8. The Neumann/pin compatibility condition is enforced: ∫ f dV is computed before solve and the mean is subtracted from the source to satisfy ∫ f dV = 0 exactly. The discontinuities persist.

  9. MWE on a unit cube hex mesh with the same MPC construction logic (attached: mwe.py) produces a smooth solution at any rank count. The bug only appears in the production setup.

Symptom

The discontinuities (“lightning”) in psi_h:

  • Align with MPI partition boundaries (verified by overlaying a DG-0 rank field)
  • Are at magnitudes ~1e-4 in a field where typical values are ~1e-4 to ~1e-3 (i.e. they’re significant)
  • Become fewer/less visible with fewer ranks (n=4 vs n=16)
  • Were not present in the previous setup (Dirichlet at inlet/outlet, no MPC)

Question

Given the MWE does NOT reproduce the issue, but my production setup does, the key differences between them are:

  • Curved hex27 mesh (P2 coordinate element) from CFD data, vs. flat hex8 in the MWE
  • 17M DOFs at P4, vs. ~10k in the MWE
  • METIS partitioning of a thin, anisotropic domain (turbine blade passage), vs. cube
  • Edge DOFs: 519 DOFs lie on the intersection of pitch and span periodic boundaries (corner edges). The MWE has only a few. I tried registering these explicitly with a diagonal translation master (instead of relying on a chain through one of the single-translation masters), but this did not change the symptom.

Could anything in this list specifically interact with dolfinx_mpc in a way that breaks parallel consistency despite backsubstitution and ghost sync both being bit-exact? Or are there additional collective consistency checks I should be doing on the MultiPointConstraint itself before solving?

MWE

[Attach mwe.py — runs in seconds on a unit cube, smooth at all rank counts]

mpirun -n 4 stdbuf -oL python -u mwe.py

Any pointers welcome — happy to add additional diagnostics or share more of the production code.

Thanks!

"""
MWE: dolfinx_mpc parallel periodicity issue.

Setup:
- Unit cube, hex mesh, P4 Lagrange (high-order matters)
- Periodic in x and y (two directions, like turbine cascade pitch + span)
- Neumann at z=0 and z=1, single Dirichlet pin to remove null space
- Periodicity via MANUAL mpc.add_constraint(...) with KDTree-based master/slave
  detection (not create_periodic_constraint_topological)
- dolfinx 0.10.x, dolfinx_mpc 0.10.0

Observation:
- Single-rank run produces smooth solution
- Parallel run (n=4, n=16) shows visible discontinuities ("lightning") in psi
  that follow MPI partition boundaries when visualized in ParaView
- MPC backsubstitution verification passes:
    max |slave - sum(coeff*master)| ~= 0 (machine precision)
  so backsubstitution itself is not the issue
- The function space of the solution is mpc.function_space (correct)
- ghostUpdate + backsubstitution are called explicitly as per Dokken's
  recommendation

Question: why do the partition-boundary discontinuities appear even though the
periodicity constraints are perfectly satisfied?

Run with:
    mpirun -n 1 python mwe.py        # smooth, reference
    mpirun -n 4 python mwe.py        # discontinuities at partitions
"""
import numpy as np
from mpi4py import MPI
from petsc4py import PETSc
from scipy.spatial import KDTree
import ufl

from dolfinx import fem, mesh
from dolfinx.mesh import locate_entities_boundary
from dolfinx.io import VTKFile
import dolfinx_mpc

comm = MPI.COMM_WORLD
rank = comm.rank

# 1. mesh
N_xy, N_z = 16, 4
msh = mesh.create_box(
    comm, [[0.0, 0.0, 0.0], [1.0, 1.0, 1.0]], [N_xy, N_xy, N_z],
    mesh.CellType.hexahedron,
)
tdim = msh.topology.dim
fdim = tdim - 1
msh.topology.create_connectivity(fdim, tdim)
msh.topology.create_connectivity(tdim, fdim)

# 2. function space
ORDER = 4
V = fem.functionspace(msh, ("Lagrange", ORDER))
if rank == 0:
    print(f"DOFs: {V.dofmap.index_map.size_global}", flush=True)
comm.Barrier()
if rank == 0: print("CP1: function space ready", flush=True)

# 3. periodic MPC setup (manual, KDTree-based)
def facets_at(axis, val):
    return locate_entities_boundary(msh, fdim, lambda x: np.isclose(x[axis], val))

pitch_slave_facets = facets_at(0, 1.0)  # x = 1 -> x = 0
span_slave_facets  = facets_at(1, 1.0)  # y = 1 -> y = 0

pitch_translation = np.array([1.0, 0.0, 0.0])
span_translation  = np.array([0.0, 1.0, 0.0])

dof_coords = V.tabulate_dof_coordinates()
imap = V.dofmap.index_map
n_local = imap.size_local
local_range_start = imap.local_range[0]

# slave dofs, locally owned only
pitch_slave_dofs = fem.locate_dofs_topological(V, fdim, pitch_slave_facets)
pitch_slave_dofs = pitch_slave_dofs[pitch_slave_dofs < n_local]

span_slave_dofs = fem.locate_dofs_topological(V, fdim, span_slave_facets)
span_slave_dofs = span_slave_dofs[span_slave_dofs < n_local]
span_slave_dofs = np.setdiff1d(span_slave_dofs, pitch_slave_dofs)

# gather all candidate master dofs (owned, near x=0 OR y=0) from all ranks
owned_coords = dof_coords[:n_local]
mask = (np.isclose(owned_coords[:, 0], 0.0, atol=1e-4)
        | np.isclose(owned_coords[:, 1], 0.0, atol=1e-4))

local_master_coords  = np.ascontiguousarray(owned_coords[mask])
local_master_globals = np.arange(n_local, dtype=np.int64)[mask] + local_range_start
local_master_owners  = np.full(mask.sum(), rank, dtype=np.int32)

if rank == 0: print(f"CP2: rank {rank} has {len(local_master_coords)} local masters, "
                    f"{len(pitch_slave_dofs)} pitch slaves, {len(span_slave_dofs)} span slaves", flush=True)
comm.Barrier()

all_master_coords  = np.vstack(comm.allgather(local_master_coords))
all_master_globals = np.concatenate(comm.allgather(local_master_globals))
all_master_owners  = np.concatenate(comm.allgather(local_master_owners))

comm.Barrier()
if rank == 0: print(f"CP3: allgather done, {len(all_master_coords)} total masters", flush=True)

master_tree = KDTree(all_master_coords)
comm.Barrier()
if rank == 0: print("CP4: kdtree built", flush=True)

slaves_list, masters_list, owners_list = [], [], []

for sl in pitch_slave_dofs:
    sc = dof_coords[sl]
    mc = sc - pitch_translation
    d, idx = master_tree.query(mc)
    if d > 1e-8:
        continue
    slaves_list.append(int(sl))
    masters_list.append(int(all_master_globals[idx]))
    owners_list.append(int(all_master_owners[idx]))

for sl in span_slave_dofs:
    sc = dof_coords[sl]
    mc = sc - span_translation
    d, idx = master_tree.query(mc)
    if d > 1e-8:
        continue
    slaves_list.append(int(sl))
    masters_list.append(int(all_master_globals[idx]))
    owners_list.append(int(all_master_owners[idx]))

slaves_arr  = np.array(slaves_list,  dtype=np.int32)
masters_arr = np.array(masters_list, dtype=np.int64)
owners_arr  = np.array(owners_list,  dtype=np.int32)
coeffs_arr  = np.ones(len(slaves_arr), dtype=np.float64)
offsets_arr = np.arange(len(slaves_arr) + 1, dtype=np.int32)

comm.Barrier()
if rank == 0: print(f"CP5: matching done, rank {rank} has {len(slaves_arr)} slave-master pairs", flush=True)

mpc = dolfinx_mpc.MultiPointConstraint(V)
comm.Barrier()
if rank == 0: print("CP6: mpc object created", flush=True)

mpc.add_constraint(V, slaves_arr, masters_arr, coeffs_arr, owners_arr, offsets_arr)
comm.Barrier()
if rank == 0: print("CP7: add_constraint done", flush=True)

mpc.finalize()
comm.Barrier()
if rank == 0: print("CP8: finalize done", flush=True)

if rank == 0:
    n_total = comm.allreduce(mpc.num_local_slaves, op=MPI.SUM)
    print(f"total MPC slaves: {n_total}", flush=True)

# 4. single pin on a non-slave dof (rank 0 only)
if rank == 0:
    slave_set = set(mpc.slaves.tolist())
    candidates = [d for d in range(n_local) if d not in slave_set]
    pin_dof = candidates[len(candidates) // 2]
    pinned = np.array([pin_dof], dtype=np.int32)
else:
    pinned = np.array([], dtype=np.int32)
bc_pin = fem.dirichletbc(np.float64(0.0), pinned, V)

# 5. Poisson with periodic source
u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
x = ufl.SpatialCoordinate(msh)
f_src = ufl.sin(2 * ufl.pi * x[0]) * ufl.cos(2 * ufl.pi * x[1])

a = ufl.inner(ufl.grad(u), ufl.grad(v)) * ufl.dx
L = f_src * v * ufl.dx

problem = dolfinx_mpc.LinearProblem(
    a, L, mpc, bcs=[bc_pin],
    petsc_options={"ksp_type": "cg", "pc_type": "bjacobi",
                   "sub_pc_type": "icc", "ksp_rtol": 1e-10,
                   "ksp_converged_reason": None},
    petsc_options_prefix="mwe",
)
psi = problem.solve()
psi.x.petsc_vec.ghostUpdate(addv=PETSc.InsertMode.INSERT,
                             mode=PETSc.ScatterMode.FORWARD)
mpc.backsubstitution(psi)

# 6. verify backsubstitution: slave == sum(coeff * master) ?
slaves = mpc.slaves
masters_adj = mpc.masters
coeffs_full, offsets_full = mpc.coefficients()

max_err = 0.0
for i in range(mpc.num_local_slaves):
    islv = int(slaves[i])
    mlist = masters_adj.links(islv)
    clist = coeffs_full[offsets_full[islv]:offsets_full[islv + 1]]
    if len(mlist) > 0:
        expected = sum(c * psi.x.array[int(m)] for c, m in zip(clist, mlist))
        max_err = max(max_err, abs(psi.x.array[islv] - expected))

max_err_g = comm.allreduce(max_err, op=MPI.MAX)
if rank == 0:
    print(f"MPC verification: max |slave - sum(coeff*master)| = {max_err_g:.3e}",
          flush=True)
    print("  (~0 => backsubstitution is correct)", flush=True)

# 7. output: psi + rank field for diagnostic
V_dg = fem.functionspace(msh, ("DG", 0))
rank_field = fem.Function(V_dg, name="rank")
rank_field.x.array[:] = float(rank)

psi.name = "psi"
with VTKFile(comm, "mwe_psi.pvd", "w") as vtk:
    vtk.write_function(psi, 0.0)
with VTKFile(comm, "mwe_rank.pvd", "w") as vtk:
    vtk.write_function(rank_field, 0.0)

if rank == 0:
    print("\nDone. Open mwe_psi.pvd and mwe_rank.pvd in ParaView.", flush=True)
    print("Expected: smooth psi. Observed (parallel): discontinuities at "
          "partition boundaries.", flush=True)