Hello FEniCS community, I meet some PETSC error while mpiexec python files in dolfinx/dolfinx docker environment on a 112 core workstation.
For a high DOFs problem, (100x100x100 mesh for this MWE’s variational problem) ,
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [100, 100, 100])
my code runs fine with mpiexec -n 8 python3 mwe.py
, but crashes for
mpiexec -n 16 python3 mwe.py
, with error:
[11]PETSC ERROR: ------------------------------------------------------------------------
[11]PETSC ERROR: Caught signal number 7 BUS: Bus Error, possibly illegal memory access
[11]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[11]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[11]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[11]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[11]PETSC ERROR: to get more information on the crash.
[11]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 11
For a small dimensional problem, say
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [3, 3, 3])
the core number can reach higher than large problem, and mpiexec with up to 22 cores work fine, but fails when
root@43279990c8d7:~/shared/dolfinx# mpiexec -n 23 python3 mwe.py
with error:
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
[10]PETSC ERROR: ------------------------------------------------------------------------
[10]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[10]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[10]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[10]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[10]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[10]PETSC ERROR: to get more information on the crash.
[10]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 10
I’m really new to mpi, and I think the solutions in error log is for PETSC but not MPI commands? May some one show me some hint on where has gone wrong. So far I suppose it’s docker’s problem (but I make no constriant on docker so it should have access on all cores)
or PETSC problem (Frequently Asked Questions (FAQ) — PETSc 3.15.3 documentation. , but here the problem should not be a excessive large matrix) .
The code is run on a workstation with 4 Intel Xeon 8280M cpus, each with 28 cores and 56 threads.
MWE:
from mpi4py import MPI
from dolfinx import (Function, FunctionSpace, BoxMesh)
from dolfinx.fem import LinearProblem
from dolfinx.io import XDMFFile
from ufl import FiniteElement, TestFunction, TrialFunction, dx, inner
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])],
[3, 3, 3])
elem_type = FiniteElement("Lagrange", mesh.ufl_cell(), 1)
V = FunctionSpace(mesh, elem_type)
f = Function(V)
f.interpolate(lambda x: x[0]*x[0] + x[1]*x[1])
u, v = TrialFunction(V), TestFunction(V)
a = inner(u, v)*dx
L = inner(f, v)*dx
problem = LinearProblem(a, L, petsc_options={"ksp_type": "preonly", "pc_type": "lu",
'pc_factor_mat_solver_type': 'mumps'})
uh = problem.solve()```