PETSC ERROR when running in mpi

YYWayne · August 29, 2021, 1:40pm

Hello FEniCS community, I meet some PETSC error while mpiexec python files in dolfinx/dolfinx docker environment on a 112 core workstation.

For a high DOFs problem, (100x100x100 mesh for this MWE’s variational problem) ,
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [100, 100, 100])
my code runs fine with mpiexec -n 8 python3 mwe.py, but crashes for
mpiexec -n 16 python3 mwe.py, with error:
[11]PETSC ERROR: ------------------------------------------------------------------------
[11]PETSC ERROR: Caught signal number 7 BUS: Bus Error, possibly illegal memory access
[11]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[11]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[11]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[11]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[11]PETSC ERROR: to get more information on the crash.
[11]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 11

For a small dimensional problem, say
mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])], [3, 3, 3])
the core number can reach higher than large problem, and mpiexec with up to 22 cores work fine, but fails when
root@43279990c8d7:~/shared/dolfinx# mpiexec -n 23 python3 mwe.py
with error:
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
[10]PETSC ERROR: ------------------------------------------------------------------------
[10]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[10]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[10]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[10]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[10]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[10]PETSC ERROR: to get more information on the crash.
[10]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 10

I’m really new to mpi, and I think the solutions in error log is for PETSC but not MPI commands? May some one show me some hint on where has gone wrong. So far I suppose it’s docker’s problem (but I make no constriant on docker so it should have access on all cores)
or PETSC problem (Frequently Asked Questions (FAQ) — PETSc 3.15.3 documentation. , but here the problem should not be a excessive large matrix) .

The code is run on a workstation with 4 Intel Xeon 8280M cpus, each with 28 cores and 56 threads.
MWE:

from mpi4py import MPI
from dolfinx import (Function, FunctionSpace, BoxMesh)
from dolfinx.fem import LinearProblem
from dolfinx.io import XDMFFile
from ufl import FiniteElement, TestFunction, TrialFunction, dx, inner

mesh = BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([1,1,1])],
               [3, 3, 3])
elem_type = FiniteElement("Lagrange", mesh.ufl_cell(), 1)
V = FunctionSpace(mesh, elem_type)

f = Function(V)
f.interpolate(lambda x: x[0]*x[0] + x[1]*x[1])

u, v = TrialFunction(V), TestFunction(V)
a = inner(u, v)*dx
L = inner(f, v)*dx

problem = LinearProblem(a, L, petsc_options={"ksp_type": "preonly", "pc_type": "lu",
                                            'pc_factor_mat_solver_type': 'mumps'})
uh = problem.solve()```

YYWayne · August 29, 2021, 1:58pm

Or substitutively, do fenics/fenicsx support swtich between float or double machine precision thus to save some memory?

dokken · August 31, 2021, 2:33pm

I can reproduce this with dolfinx in docker on my computer, and an issue has been posted at:

github.com/FEniCS/dolfinx

UnitCubeMesh crashes with MPI

opened 02:32PM - 31 Aug 21 UTC

closed 05:40PM - 13 Jul 22 UTC

jorgensd

bug

MWE: ```python import dolfinx from mpi4py import MPI import numpy as np m…esh = dolfinx.UnitCubeMesh(MPI.COMM_WORLD, 2, 1, 1) if MPI.COMM_WORLD.rank == 0: print(mesh.topology.index_map(mesh.topology.dim).size_global) ``` Output: ```bash mpirun -n 6 python3 test_code.py [5]PETSC ERROR: ------------------------------------------------------------------------ [5]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [5]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [5]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [5]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [5]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [5]PETSC ERROR: to get more information on the crash. [5]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash. application called MPI_Abort(MPI_COMM_WORLD, 59) - process 5 ``` This code should run as the mesh has 6 x 2 x 1 x 1 = 12 cells Originally reported at: https://fenicsproject.discourse.group/t/petsc-error-when-running-in-mpi/6460 Code executed with latest `dolfinx/dolfinx` docker image and the command ```bash docker run -ti -v $(pwd):/root/shared -w /root/shared --rm dolfinx/dolfinx ```

garth · August 31, 2021, 8:02pm

See the last question at Frequently Asked Questions - Mpich. It’s likely that you’re running out of memory in the container.

The failure for the small mesh is almost certainly a different issue; it’s likely that some ranks have no cells, which is not supported yet.

YYWayne · September 1, 2021, 1:20am

Thanks for your kind replies! I will have a check on container memory

YYWayne · September 1, 2021, 1:36am

Garth’s point is true for small mesh failure. I can run with more cores by changing MPI.COMM_WORLD parameter to MPI.COMM_SELF Implementation — FEniCSx tutorial.

Topic		Replies	Views
PETSC ERROR: Caught signal number 4 Illegal instruction: Likely due to memory corruption General	17	565	July 8, 2024
Can't create mesh in parallel General	10	479	October 17, 2023
Trouble running demos on dolfinx General	4	529	March 3, 2024
SEGV fault when interpolating function onto different mesh Errors mesh , dolfinx	5	185	February 12, 2024
Petsc error when calling Newton solver Errors	13	817	December 22, 2024

PETSC ERROR when running in mpi

Related topics