Hello FEniCS community, I meet some PETSC error while mpiexec python files in dolfinx/dolfinx docker environment on a 112 core workstation.
For a high DOFs problem, (100x100x100 mesh for this MWE’s variational problem) ,
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [100, 100, 100])
my code runs fine with mpiexec -n 8 python3 mwe.py
, but crashes for
mpiexec -n 16 python3 mwe.py
, with error:
[11]PETSC ERROR: ------------------------------------------------------------------------
[11]PETSC ERROR: Caught signal number 7 BUS: Bus Error, possibly illegal memory access
[11]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[11]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[11]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[11]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[11]PETSC ERROR: to get more information on the crash.
[11]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 11
For a small dimensional problem, say
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [3, 3, 3])
the core number can reach higher than large problem, and mpiexec with up to 22 cores work fine, but fails when
root@43279990c8d7:~/shared/dolfinx# mpiexec -n 23 python3 mwe.py
with error:
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
[10]PETSC ERROR: ------------------------------------------------------------------------
[10]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[10]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[10]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[10]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[10]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[10]PETSC ERROR: to get more information on the crash.
[10]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 10
I’m really new to mpi, and I think the solutions in error log is for PETSC but not MPI commands? May some one show me some hint on where has gone wrong. So far I suppose it’s docker’s problem (but I make no constriant on docker so it should have access on all cores)
or PETSC problem (Frequently Asked Questions (FAQ) — PETSc 3.15.3 documentation. , but here the problem should not be a excessive large matrix) .
The code is run on a workstation with 4 Intel Xeon 8280M cpus, each with 28 cores and 56 threads.
from mpi4py import MPI
from dolfinx import (Function, FunctionSpace, BoxMesh)
from dolfinx.fem import LinearProblem
from dolfinx.io import XDMFFile
from ufl import FiniteElement, TestFunction, TrialFunction, dx, inner
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])],
[3, 3, 3])
elem_type = FiniteElement("Lagrange", mesh.ufl_cell(), 1)
V = FunctionSpace(mesh, elem_type)
f = Function(V)
f.interpolate(lambda x: x[0]*x[0] + x[1]*x[1])
u, v = TrialFunction(V), TestFunction(V)
a = inner(u, v)*dx
L = inner(f, v)*dx
problem = LinearProblem(a, L, petsc_options={"ksp_type": "preonly", "pc_type": "lu",
'pc_factor_mat_solver_type': 'mumps'})
uh = problem.solve()```