Strange problem about generating the mesh in DOLFINX in parallel

Hi all, I tried to generate the mesh with DOLFINX in parallel. The MWE is as follows.

from mpi4py import MPI
import dolfinx
from dolfinx.cpp.mesh import CellType

mesh = dolfinx.BoxMesh(MPI.COMM_WORLD,[[0.0,0.0,0.0], [200, 200, 200]], [96, 96, 96], CellType.hexahedron)

It works normally when I use 14 cores (or less than 14) with the command mpirun -n 14 python3 test.py.
However, when I use 15 (or more than 15) cores with the command mpirun -n 15 python3 test.py, an error message appears as

Assertion failed in file src/mpi/comm/comm_rank.c at line 55: 0
Assertion failed in file src/mpi/comm/comm_rank.c at line 55: 0
Assertion failed in file src/mpi/comm/comm_rank.c at line 55: 0
/lib/x86_64-linux-gnu/libmpich.so.12(MPL_backtrace_show+0x39) [0x7f8925586069]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x29bfe8) [0x7f89254e2fe8]
/lib/x86_64-linux-gnu/libmpich.so.12(MPI_Comm_rank+0x218) [0x7f89253e3fb8]
/usr/local/petsc/linux-gnu-real-32/lib/libpetsc.so.3.16(PetscFPrintf+0x9e) [0x7f891e7ce92e]
/usr/local/petsc/linux-gnu-real-32/lib/libpetsc.so.3.16(PetscErrorPrintfDefault+0x9e) [0x7f891e89e71e]
/usr/local/petsc/linux-gnu-real-32/lib/libpetsc.so.3.16(PetscSignalHandlerDefault+0x149) [0x7f891e89f889]
/usr/local/petsc/linux-gnu-real-32/lib/libpetsc.so.3.16(+0x17fad7) [0x7f891e89fad7]
/lib/x86_64-linux-gnu/libc.so.6(+0x41040) [0x7f8926187040]
/lib/x86_64-linux-gnu/libc.so.6(+0x1831cc) [0x7f89262c91cc]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x28eb99) [0x7f89254d5b99]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x326c82) [0x7f892556dc82]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x327e97) [0x7f892556ee97]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x32552e) [0x7f892556c52e]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x31caf0) [0x7f8925563af0]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x1e2e7e) [0x7f8925429e7e]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x1e3240) [0x7f892542a240]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x27b7d1) [0x7f89254c27d1]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x217da2) [0x7f892545eda2]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x15893e) [0x7f892539f93e]
/lib/x86_64-linux-gnu/libmpich.so.12(PMPI_Alltoallv+0xaf1) [0x7f89253a04b1]
/usr/local/dolfinx-real/lib/libdolfinx.so.0.3(_ZN7dolfinx5graph5build10distributeEiRKNS0_13AdjacencyListIlEERKNS2_IiEE+0x89a) [0x7f89200aeeaa]
/usr/local/dolfinx-real/lib/libdolfinx.so.0.3(_ZN7dolfinx4mesh11create_meshEiRKNS_5graph13AdjacencyListIlEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSA_7uvectorIdSaIdEEELm2ELNSA_11layout_typeE1ENSA_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS2_IiEEiiiS5_SK_EE+0x15f) [0x7f8920127abf]
/usr/local/dolfinx-real/lib/libdolfinx.so.0.3(+0x10605d) [0x7f892008805d]
/usr/local/dolfinx-real/lib/libdolfinx.so.0.3(_ZN7dolfinx10generation7BoxMesh6createEiRKSt5arrayIS2_IdLm3EELm2EES2_ImLm3EENS_4mesh8CellTypeENS8_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEEiiiRKNSD_IlEESA_EE+0x85) [0x7f89200884d5]
/usr/local/dolfinx-real/lib/python3.8/dist-packages/dolfinx/cpp.cpython-39-x86_64-linux-gnu.so(+0x128e94) [0x7f892032ae94]
/usr/local/dolfinx-real/lib/python3.8/dist-packages/dolfinx/cpp.cpython-39-x86_64-linux-gnu.so(+0x4e443) [0x7f8920250443]
python3() [0x54350c]
python3(_PyObject_MakeTpCall+0x39b) [0x521d6b]
python3(_PyEval_EvalFrameDefault+0x5be8) [0x51b9f8]
python3() [0x514a75]
python3(_PyFunction_Vectorcall+0x342) [0x52d302]
python3(_PyEval_EvalFrameDefault+0x559c) [0x51b3ac]
internal ABORT - process 0

So could anyone please tell me what happens?

Another possibly related question: I have a code solving the nonlinear elasticity problem. It works for the 48-48-48 mesh perfectly. However, the residual is -NaN in the first Newton iteration for the 96-96-96 mesh. I tried all the methods mentioned in the forum including avoiding the division by zero, adding a small number in the ufl.sqrt(), using a random initial guess and so on. I still could not solve this problem. I would like to know is it related to the above problem?

Thanks!

I cannot reproduce the behaviour with v0.3.0 using Spack

from mpi4py import MPI
import dolfinx
from dolfinx.cpp.mesh import CellType

print(MPI.COMM_WORLD.rank, dolfinx.__version__, dolfinx.common.git_commit_hash)
mesh = dolfinx.BoxMesh(MPI.COMM_WORLD,[[0.0,0.0,0.0], [200, 200, 200]], [96, 96, 96], CellType.hexah\
edron)
print(MPI.COMM_WORLD.rank, "Finalized")

or with latest main:
0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
it points to an issue with your MPI installation.

2 Likes

Hi @dokken, I try to upgrade the docker image to the latest version and this problem still appears.
Running with 14 cores.

root@8eeea2dcf9d4:/shared# mpirun -n 14 python3 test.py
6 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
6 Finalized
9 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
9 Finalized
0 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
0 Finalized
4 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
4 Finalized
11 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
11 Finalized
1 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
1 Finalized
2 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
2 Finalized
3 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
3 Finalized
7 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
7 Finalized
8 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
8 Finalized
10 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
10 Finalized
12 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
12 Finalized
13 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
13 Finalized
5 0.3.1.0 d2ecfa4a0c78a22f0c5aff6fc29ce1c90dc20612
5 Finalized

Running with 15 cores:

root@8eeea2dcf9d4:/shared# mpirun -n 15 python3 test.py

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 14335 RUNNING AT 8eeea2dcf9d4
=   EXIT CODE: 7
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Is it because of memory issues?