PETSC ERROR when running in mpi

Hello FEniCS community, I meet some PETSC error while mpiexec python files in dolfinx/dolfinx docker environment on a 112 core workstation.

For a high DOFs problem, (100x100x100 mesh for this MWE’s variational problem) ,
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [100, 100, 100])
my code runs fine with mpiexec -n 8 python3 mwe.py, but crashes for
mpiexec -n 16 python3 mwe.py, with error:
[11]PETSC ERROR: ------------------------------------------------------------------------
[11]PETSC ERROR: Caught signal number 7 BUS: Bus Error, possibly illegal memory access
[11]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[11]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[11]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[11]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[11]PETSC ERROR: to get more information on the crash.
[11]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 11

For a small dimensional problem, say
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [3, 3, 3])
the core number can reach higher than large problem, and mpiexec with up to 22 cores work fine, but fails when
root@43279990c8d7:~/shared/dolfinx# mpiexec -n 23 python3 mwe.py
with error:
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
[10]PETSC ERROR: ------------------------------------------------------------------------
[10]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[10]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[10]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[10]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[10]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[10]PETSC ERROR: to get more information on the crash.
[10]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 10

I’m really new to mpi, and I think the solutions in error log is for PETSC but not MPI commands? May some one show me some hint on where has gone wrong. So far I suppose it’s docker’s problem (but I make no constriant on docker so it should have access on all cores)
or PETSC problem (Frequently Asked Questions (FAQ) — PETSc 3.15.3 documentation. , but here the problem should not be a excessive large matrix) .

The code is run on a workstation with 4 Intel Xeon 8280M cpus, each with 28 cores and 56 threads.
MWE:

from mpi4py import MPI
from dolfinx import (Function, FunctionSpace, BoxMesh)
from dolfinx.fem import LinearProblem
from dolfinx.io import XDMFFile
from ufl import FiniteElement, TestFunction, TrialFunction, dx, inner

mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])],
               [3, 3, 3])
elem_type = FiniteElement("Lagrange", mesh.ufl_cell(), 1)
V = FunctionSpace(mesh, elem_type)

f = Function(V)
f.interpolate(lambda x: x[0]*x[0] + x[1]*x[1])

u, v = TrialFunction(V), TestFunction(V)
a = inner(u, v)*dx
L = inner(f, v)*dx

problem = LinearProblem(a, L, petsc_options={"ksp_type": "preonly", "pc_type": "lu",
                                            'pc_factor_mat_solver_type': 'mumps'})
uh = problem.solve()```

Or substitutively, do fenics/fenicsx support swtich between float or double machine precision thus to save some memory?

I can reproduce this with dolfinx in docker on my computer, and an issue has been posted at:

2 Likes

See the last question at Frequently Asked Questions - Mpich. It’s likely that you’re running out of memory in the container.

The failure for the small mesh is almost certainly a different issue; it’s likely that some ranks have no cells, which is not supported yet.

3 Likes

Thanks for your kind replies! I will have a check on container memory

Garth’s point is true for small mesh failure. I can run with more cores by changing MPI.COMM_WORLD parameter to MPI.COMM_SELF Implementation — FEniCSx tutorial.