Instability of mesh read with HPC

Hi, I installed dolfinx with conda-forge on CentOS 8 on an HPC on Nov 8, 2023 with MPICH. But on this and previous installs I have had a problem with mesh reading when I use many processors spread over multiple nodes. The following code fails sometimes when going to >100 processes, with increasing probability of failure as I add more processors.

from mpi4py import MPI
import dolfinx.io
with dolfinx.io.XDMFFile(MPI.COMM_WORLD, f'./mesh/naca0012_AR2.xdmf', 'r') as xdmf:
    mesh = xdmf.read_mesh(name='Grid')

Failure occurs when I launch a job with SLURM but also when I just oversubscribe the number of processes (e.g. mpirun -np 350 python3 test.py) on bash. The mesh is a 3D mesh made using gmsh and converted to xdmf. Here is a google drive link to a mesh I use.

Sometimes it runs fine (especially at lower number of processors), sometimes it runs forever (time-out), and on a previous install with conda I have gotten an error that looks like

Abort(479297679) on node 177 (rank 177 in comm 448): Fatal error in internal_Gatherv: Other MPI error, error stack:
internal_Gatherv(156)......................: MPI_Gatherv(sendbuf=0x7ffc1258d560, sendcount=5, MPI_CHAR, recvbuf=(nil), recvcounts=(nil), displs=(nil), MPI_CHAR, 0, comm=0x84000007) failed
MPID_Gatherv(920)..........................: 
MPIDI_Gatherv_intra_composition_alpha(1308): 
MPIDI_NM_mpi_gatherv(153)..................: 
MPIR_Gatherv_impl(1175)....................: 
MPIR_Gatherv_allcomm_auto(1120)............: 
MPIR_Gatherv_allcomm_linear(137)...........: 
MPIC_Ssend(252)............................: 
MPIC_Wait(64)..............................: 
MPIR_Wait_state(886).......................: 
MPID_Progress_wait(335)....................: 
MPIDI_progress_test(158)...................: 
MPIDI_OFI_handle_cq_error(625).............: OFI poll failed (ofi_events.c:627:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(1016168591) on node 178 (rank 178 in comm 448): Fatal error in internal_Gatherv: Other MPI error, error stack:
internal_Gatherv(156)......................: MPI_Gatherv(sendbuf=0x7ffde4ddac10, sendcount=5, MPI_CHAR, recvbuf=(nil), recvcounts=(nil), displs=(nil), MPI_CHAR, 0, comm=0x84000007) failed
MPID_Gatherv(920)..........................: 
MPIDI_Gatherv_intra_composition_alpha(1308): 
MPIDI_NM_mpi_gatherv(153)..................: 
MPIR_Gatherv_impl(1175)....................: 
MPIR_Gatherv_allcomm_auto(1120)............: 
MPIR_Gatherv_allcomm_linear(137)...........: 
MPIC_Ssend(252)............................: 
MPIC_Wait(64)..............................: 
MPIR_Wait_state(886).......................: 
MPID_Progress_wait(335)....................: 
MPIDI_progress_test(158)...................: 
MPIDI_OFI_handle_cq_error(625).............: OFI poll failed (ofi_events.c:627:MPIDI_OFI_handle_cq_error:Input/output error)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1377759 RUNNING AT n533
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@n535] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:2@n535] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@n535] main (proxy/pmip.c:127): demux engine error waiting for event

srun: error: n543: task 10: Exited with exit code 7
srun: launch/slurm: _step_signal: Terminating StepId=621360.0
[mpiexec@n533] HYDT_bscu_wait_for_completion (lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting
[mpiexec@n533] HYDT_bsci_wait_for_completion (lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion
[mpiexec@n533] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:197): launcher returned error waiting for completion
[mpiexec@n533] main (mpiexec/mpiexec.c:252): process manager error waiting for completion

How big is your mesh?

Please note that conda isnā€™t really the best way of using DOLFINx if you want to go to large problems, as it has limitations on PETSc integer representation: 64-bit-indices Ā· Issue #163 Ā· conda-forge/petsc-feedstock Ā· GitHub

The mesh in the shared file has ~85700 nodes and ~515500 tetrahedra. With this mesh, when it loads, the solutions converge, but it obviously is not smooth at all so I would like to increase to ~350000 nodes with 2E+6 tetrahedra. Iā€™m solving an aeroelastic problem so that unfortunately requires a huge number of DoFs, therefore requiring a lot of memory from multiple nodes.

Unfortunately, my lab will not allow me to install docker, and spack for me has had a problem in that neither mpirun nor srun recognises processors from more than one node for some reason when I launch a SLURM job and it seems to do with my installation (as launching with multiple nodes outside of the spack environment is totally fine). My installation is done with

spack add py-pip py-cython
spack add py-fenics-dolfinx@main%gcc@10.2.0 ^openmpi@4.1.6 ^petsc@3.20.2 cflags="-O3" fflags="-O3" +hypre+mumps ^fenics-dolfinx+slepc
spack add py-slepc4py
spack concretize -f
spack install --dirty

If you know something I should change I would be happy to try installing via spack again. The outputs when I launch a job with spackā€™s mpi is

There are not enough slots available in the system to satisfy the 60
slots that were requested by the application:

  python3

Either request fewer slots for your application, or make more slots
available for use.

I would suggest trying to use mpich instead of openmpi, as there seems to be issues with openmpi:

I also see to recall that you should turn on pmi to work on clusters, i.e. ^openmpi@4.1.6+pmi (spack/var/spack/repos/builtin/packages/openmpi/package.py at develop Ā· spack/spack Ā· GitHub)

Iā€™m actually not too familiar with spack. Before I go into taking the time to reinstall everything, Iā€™d like to make sure: so youā€™d change
^openmpi@4.1.6 to ^mpich@4.2.0 or ^mpich@4.2.0+pmi? Or is pmi only for openmpi?

You can set pmi for either. See for instance: Spack Packages

Using ^mpich@4.2.0+pmi I get Error: invalid values for variant "pmi" in package "mpich": [True] Do you know the correct syntax or is pmi on by default for mpich unlike openmpi?

Maybe try +slurm instead.

I tried installing with
spack add py-fenics-dolfinx@main%gcc@10.2.0 ^mpich+slurm ^petsc@3.20.2 cflags="-O3" fflags="-O3" +hypre+mumps ^fenics-dolfinx+slepc
but during the final install of dolfinx I am hit with errors such as

 >> 3111    /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gc
             c-10.2.0/fenics-dolfinx-main-glit4atksvssb4s6iaslfw43yx77blxf/incl
             ude/dolfinx/fem/assemble_matrix_impl.h:386:35: error: deduced init
             ializer does not satisfy placeholder constraints

and

  >> 3134    /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gc
             c-10.2.0/fenics-dolfinx-main-glit4atksvssb4s6iaslfw43yx77blxf/incl
             ude/dolfinx/fem/FiniteElement.h:27:12: error: the value of 'std::i
             s_invocable_v<std::function<void(const std::span<std::complex<floa
             t>, 18446744073709551615>&, const std::span<const unsigned int>&, 
             int, int)>, const std::span<_Type, 18446744073709551615>&, const s
             td::span<const unsigned int, 18446744073709551615>&, int, int>' is
              not usable in a constant expression

ending with

See build log for details:
  /tmp/hmmak/spack-stage/spack-stage-py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc/spack-build-out.txt

==> Error: py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc: ProcessError: Command exited with status 1:
    '/tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gcc-10.2.0/python-3.11.7-mfgsn54gb42z23zvrb4z6uacebtexprr/bin/python3.11' '-m' 'pip' '-vvv' '--no-input' '--no-cache-dir' '--disable-pip-version-check' 'install' '--no-deps' '--ignore-installed' '--no-build-isolation' '--no-warn-script-location' '--no-index' '--prefix=/tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gcc-10.2.0/py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc' '.'
==> Error: py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc: Package was not installed
==> Updating view at /tmp_user/sator/hmmak/spack/var/spack/environments/rfx/.spack-env/view
==> Error: Installation request failed.  Refer to reported errors for failing package(s).

Moreover running a slurm job via sbatch in this environment (without dolfinx) results directly in error

$ sbatch test_spack.sh
sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sbatch: error: fetch_config: DNS SRV lookup failed
sbatch: error: _establish_config_source: failed to fetch config
sbatch: fatal: Could not establish a configuration source

So I downgraded dolfinx from @main to @0.7.2 and that installed well. But I still have this error

$ sbatch test_spack.sh
sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sbatch: error: fetch_config: DNS SRV lookup failed
sbatch: error: _establish_config_source: failed to fetch config
sbatch: fatal: Could not establish a configuration source

If youā€™re running on a well-maintained cluster, itā€™s recommended to use the system MPI library. See Package Settings (packages.yaml) ā€” Spack 0.21.1 documentation on how to use system installed packages/libraries.

3 Likes

Thank you, directly using my clusterā€™s MPI library seems to have done the trick to allow multi-node/multi-processor running with dolfinx@0.7.2. I will now test if there continue to be instabilities (preliminary testing suggests that all runs well).

However, Iā€™m still having trouble getting dolfinx@main to install; Iā€™m getting the same error as above (even when adding spack add fenics-dolfinx+adios2 and using the petsc@3.20.4, which I know shouldnā€™t affect it but I tried it anyways). I would like to use nanobind as there were some issues with multiphenicsx that was fixed with the nanobind update.

Try with GCC 11 or later. Looks like error message could be related to C++20 ā€˜conceptsā€™; I donā€™t if concepts are fully supported in GCC 10.

I have a PR at Always pass `std::span` by value by garth-wells Ā· Pull Request #3059 Ā· FEniCS/dolfinx Ā· GitHub that changes all std::span to pass value, which might make a difference with GCC 10.

The PR didnā€™t seem to fix the installation error. It just changed to

 >> 3029    /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-10.2.0/fenics-dolfinx-main-pzgtpq4wgxayprqms4ewydsm6tuvmdzn/include/dolfinx/fem/FiniteElement.h:27:12: er
             reur: la valeur de Ā« std::is_invocable_v<std::function<void(std::span<std::complex<double>, 18446744073709551615>, std::span<const unsigned int>, int, int)>, std::span<_Ty
             pe, 18446744073709551615>, std::span<const unsigned int, 18446744073709551615>, int, int> Ā» n'est pas utilisable dans une expression constante

with fewer const.

Nonetheless, I installed gcc 11 via spack and dolfinx installed well. There was one hick-up in that when I tried to install multiphenicsx, it still required me explicitly loading gcc 10 as there was this error with dolfinx

/tmp_user/sator/hmmak/spack/var/spack/environments/rlfx/.spack-env/view/include/dolfinx/la/utils.h:9:10: fatal error: concepts: No such file or directory
       #include <concepts>
                ^~~~~~~~~~
      compilation terminated.

where concepts seems to not exist in gcc 10. But once loaded multiphenicsx installed fine.

I will now test if the instability still occurs with this installation.

You may want to explicitly pass ā€œCXX=your_compiler VERBOSE=1 pip install -vā€¦ā€ when installing multiphenicsx to see if that helps pip in using the compiler you installed.

1 Like

That worked, thank you!

I think the problem is persisting despite this. Iā€™m getting errors with xdmf.read_mesh that are outputting

[n347:52168] PSM2 returned unhandled/unknown connect error: Operation timed out
[n347:52168] PSM2 EP connect error (unknown connect error):
[n347:52168]  n256
[n347:52168] 
[212]PETSC ERROR: ------------------------------------------------------------------------
[212]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[212]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[212]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[212]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[212]PETSC ERROR: to get more information on the crash.
[212]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 212 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

My systemā€™s local MPI is openmpi 4.1.1, as opposed to previously where mpich was installed with anaconda.

Moreover, this is perhaps related to the issue I raised on petsc4py while on anaconda where there was also instability with a large number of processors (through more testing perhaps actually not necessarily processors but rather number of nodes). Especially now with the new install I get a similar error from petsc4py.init() as xdmf.read_mesh as follows

[n347:43460] PSM2 returned unhandled/unknown connect error: Operation timed out
[n347:43460] PSM2 EP connect error (unknown connect error):
[n347:43460]  n136
[n347:43460] 
[n347:43460] *** Process received signal ***
[n347:43460] Signal: Segmentation fault (11)
[n347:43460] Signal code: Address not mapped (1)
[n347:43460] Failing at address: 0x38
[n347:43460] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x14fd1816cb20]
[n347:43460] [ 1] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/openmpi/mca_mtl_psm2.so(ompi_mtl_psm2_send+0x266)[0x14fcf73c4f16]
[n347:43460] [ 2] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/openmpi/mca_pml_cm.so(+0x3c3a)[0x14fcf75cbc3a]
[n347:43460] [ 3] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/libmpi.so.40(ompi_coll_base_barrier_intra_tree+0xe6)[0x14fd031b99a6]
[n347:43460] [ 4] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x14fd0316ed48]
[n347:43460] [ 5] /tmp_user/sator/hmmak/spack/var/spack/environments/rlfx/.spack-env/view/lib/python3.11/site-packages/mpi4py/MPI.cpython-311-x86_64-linux-gnu.so(+0x58ac1)[0x14fd03487ac1]
[n347:43460] [ 6] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x16ac3e)[0x14fd186f7c3e]
[n347:43460] [ 7] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyObject_Vectorcall+0x34)[0x14fd186eb804]
[n347:43460] [ 8] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x3e23)[0x14fd1868e953]
[n347:43460] [ 9] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyEval_EvalCode+0x217)[0x14fd187e3127]
[n347:43460] [10] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x252080)[0x14fd187df080]
[n347:43460] [11] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x1add82)[0x14fd1873ad82]
[n347:43460] [12] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x707f)[0x14fd18691baf]
[n347:43460] [13] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x256279)[0x14fd187e3279]
[n347:43460] [14] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x15ea4b)[0x14fd186eba4b]
[n347:43460] [15] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyObject_CallMethodObjArgs+0xf0)[0x14fd186ebc30]
[n347:43460] [16] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyImport_ImportModuleLevelObject+0x4c1)[0x14fd18810ea1]
[n347:43460] [17] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0xac0e)[0x14fd1869573e]
[n347:43460] [18] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyEval_EvalCode+0x217)[0x14fd187e3127]
[n347:43460] [19] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x252080)[0x14fd187df080]
[n347:43460] [20] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x1add82)[0x14fd1873ad82]
[n347:43460] [21] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x707f)[0x14fd18691baf]
[n347:43460] [22] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x256279)[0x14fd187e3279]
[n347:43460] [23] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x15ea4b)[0x14fd186eba4b]
[n347:43460] [24] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyObject_CallMethodObjArgs+0xf0)[0x14fd186ebc30]
[n347:43460] [25] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyImport_ImportModuleLevelObject+0x4c1)[0x14fd18810ea1]
[n347:43460] [26] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0xac0e)[0x14fd1869573e]
[n347:43460] [27] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyEval_EvalCode+0x217)[0x14fd187e3127]
[n347:43460] [28] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x29ee2d)[0x14fd1882be2d]
[n347:43460] [29] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyRun_SimpleFileObject+0x14f)[0x14fd1882d5af]
[n347:43460] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 320 with PID 43460 on node n347 exited on signal 11 (Segmentation fault).

I also had a problem with the spack install with a specific mesh that worked in anaconda (with the same number of node/processor combination) where it would completely stall with multiphenicsx.fem.petsc.create_vector_block(Fform, restriction = restriction). No output message at all; it would just run forever.