Hi, I installed dolfinx with conda-forge on CentOS 8 on an HPC on Nov 8, 2023 with MPICH. But on this and previous installs I have had a problem with mesh reading when I use many processors spread over multiple nodes. The following code fails sometimes when going to >100 processes, with increasing probability of failure as I add more processors.
from mpi4py import MPI
import dolfinx.io
with dolfinx.io.XDMFFile(MPI.COMM_WORLD, f'./mesh/naca0012_AR2.xdmf', 'r') as xdmf:
mesh = xdmf.read_mesh(name='Grid')
Failure occurs when I launch a job with SLURM but also when I just oversubscribe the number of processes (e.g. mpirun -np 350 python3 test.py) on bash. The mesh is a 3D mesh made using gmsh and converted to xdmf. Here is a google drive link to a mesh I use.
Sometimes it runs fine (especially at lower number of processors), sometimes it runs forever (time-out), and on a previous install with conda I have gotten an error that looks like
Abort(479297679) on node 177 (rank 177 in comm 448): Fatal error in internal_Gatherv: Other MPI error, error stack:
internal_Gatherv(156)......................: MPI_Gatherv(sendbuf=0x7ffc1258d560, sendcount=5, MPI_CHAR, recvbuf=(nil), recvcounts=(nil), displs=(nil), MPI_CHAR, 0, comm=0x84000007) failed
MPID_Gatherv(920)..........................:
MPIDI_Gatherv_intra_composition_alpha(1308):
MPIDI_NM_mpi_gatherv(153)..................:
MPIR_Gatherv_impl(1175)....................:
MPIR_Gatherv_allcomm_auto(1120)............:
MPIR_Gatherv_allcomm_linear(137)...........:
MPIC_Ssend(252)............................:
MPIC_Wait(64)..............................:
MPIR_Wait_state(886).......................:
MPID_Progress_wait(335)....................:
MPIDI_progress_test(158)...................:
MPIDI_OFI_handle_cq_error(625).............: OFI poll failed (ofi_events.c:627:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(1016168591) on node 178 (rank 178 in comm 448): Fatal error in internal_Gatherv: Other MPI error, error stack:
internal_Gatherv(156)......................: MPI_Gatherv(sendbuf=0x7ffde4ddac10, sendcount=5, MPI_CHAR, recvbuf=(nil), recvcounts=(nil), displs=(nil), MPI_CHAR, 0, comm=0x84000007) failed
MPID_Gatherv(920)..........................:
MPIDI_Gatherv_intra_composition_alpha(1308):
MPIDI_NM_mpi_gatherv(153)..................:
MPIR_Gatherv_impl(1175)....................:
MPIR_Gatherv_allcomm_auto(1120)............:
MPIR_Gatherv_allcomm_linear(137)...........:
MPIC_Ssend(252)............................:
MPIC_Wait(64)..............................:
MPIR_Wait_state(886).......................:
MPID_Progress_wait(335)....................:
MPIDI_progress_test(158)...................:
MPIDI_OFI_handle_cq_error(625).............: OFI poll failed (ofi_events.c:627:MPIDI_OFI_handle_cq_error:Input/output error)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 1377759 RUNNING AT n533
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@n535] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:2@n535] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@n535] main (proxy/pmip.c:127): demux engine error waiting for event
srun: error: n543: task 10: Exited with exit code 7
srun: launch/slurm: _step_signal: Terminating StepId=621360.0
[mpiexec@n533] HYDT_bscu_wait_for_completion (lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting
[mpiexec@n533] HYDT_bsci_wait_for_completion (lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion
[mpiexec@n533] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:197): launcher returned error waiting for completion
[mpiexec@n533] main (mpiexec/mpiexec.c:252): process manager error waiting for completion
The mesh in the shared file has ~85700 nodes and ~515500 tetrahedra. With this mesh, when it loads, the solutions converge, but it obviously is not smooth at all so I would like to increase to ~350000 nodes with 2E+6 tetrahedra. Iām solving an aeroelastic problem so that unfortunately requires a huge number of DoFs, therefore requiring a lot of memory from multiple nodes.
Unfortunately, my lab will not allow me to install docker, and spack for me has had a problem in that neither mpirun nor srun recognises processors from more than one node for some reason when I launch a SLURM job and it seems to do with my installation (as launching with multiple nodes outside of the spack environment is totally fine). My installation is done with
If you know something I should change I would be happy to try installing via spack again. The outputs when I launch a job with spackās mpi is
There are not enough slots available in the system to satisfy the 60
slots that were requested by the application:
python3
Either request fewer slots for your application, or make more slots
available for use.
Iām actually not too familiar with spack. Before I go into taking the time to reinstall everything, Iād like to make sure: so youād change ^openmpi@4.1.6 to ^mpich@4.2.0 or ^mpich@4.2.0+pmi? Or is pmi only for openmpi?
Using ^mpich@4.2.0+pmi I get Error: invalid values for variant "pmi" in package "mpich": [True] Do you know the correct syntax or is pmi on by default for mpich unlike openmpi?
I tried installing with spack add py-fenics-dolfinx@main%gcc@10.2.0 ^mpich+slurm ^petsc@3.20.2 cflags="-O3" fflags="-O3" +hypre+mumps ^fenics-dolfinx+slepc
but during the final install of dolfinx I am hit with errors such as
>> 3111 /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gc
c-10.2.0/fenics-dolfinx-main-glit4atksvssb4s6iaslfw43yx77blxf/incl
ude/dolfinx/fem/assemble_matrix_impl.h:386:35: error: deduced init
ializer does not satisfy placeholder constraints
and
>> 3134 /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gc
c-10.2.0/fenics-dolfinx-main-glit4atksvssb4s6iaslfw43yx77blxf/incl
ude/dolfinx/fem/FiniteElement.h:27:12: error: the value of 'std::i
s_invocable_v<std::function<void(const std::span<std::complex<floa
t>, 18446744073709551615>&, const std::span<const unsigned int>&,
int, int)>, const std::span<_Type, 18446744073709551615>&, const s
td::span<const unsigned int, 18446744073709551615>&, int, int>' is
not usable in a constant expression
ending with
See build log for details:
/tmp/hmmak/spack-stage/spack-stage-py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc/spack-build-out.txt
==> Error: py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc: ProcessError: Command exited with status 1:
'/tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gcc-10.2.0/python-3.11.7-mfgsn54gb42z23zvrb4z6uacebtexprr/bin/python3.11' '-m' 'pip' '-vvv' '--no-input' '--no-cache-dir' '--disable-pip-version-check' 'install' '--no-deps' '--ignore-installed' '--no-build-isolation' '--no-warn-script-location' '--no-index' '--prefix=/tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-cascadelake/gcc-10.2.0/py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc' '.'
==> Error: py-fenics-dolfinx-main-5n5ldrh6jzdg2hhdldg2b6ueaebd2ggc: Package was not installed
==> Updating view at /tmp_user/sator/hmmak/spack/var/spack/environments/rfx/.spack-env/view
==> Error: Installation request failed. Refer to reported errors for failing package(s).
Moreover running a slurm job via sbatch in this environment (without dolfinx) results directly in error
$ sbatch test_spack.sh
sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sbatch: error: fetch_config: DNS SRV lookup failed
sbatch: error: _establish_config_source: failed to fetch config
sbatch: fatal: Could not establish a configuration source
Thank you, directly using my clusterās MPI library seems to have done the trick to allow multi-node/multi-processor running with dolfinx@0.7.2. I will now test if there continue to be instabilities (preliminary testing suggests that all runs well).
However, Iām still having trouble getting dolfinx@main to install; Iām getting the same error as above (even when adding spack add fenics-dolfinx+adios2 and using the petsc@3.20.4, which I know shouldnāt affect it but I tried it anyways). I would like to use nanobind as there were some issues with multiphenicsx that was fixed with the nanobind update.
The PR didnāt seem to fix the installation error. It just changed to
>> 3029 /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-10.2.0/fenics-dolfinx-main-pzgtpq4wgxayprqms4ewydsm6tuvmdzn/include/dolfinx/fem/FiniteElement.h:27:12: er
reur: la valeur de Ā« std::is_invocable_v<std::function<void(std::span<std::complex<double>, 18446744073709551615>, std::span<const unsigned int>, int, int)>, std::span<_Ty
pe, 18446744073709551615>, std::span<const unsigned int, 18446744073709551615>, int, int> Ā» n'est pas utilisable dans une expression constante
with fewer const.
Nonetheless, I installed gcc 11 via spack and dolfinx installed well. There was one hick-up in that when I tried to install multiphenicsx, it still required me explicitly loading gcc 10 as there was this error with dolfinx
/tmp_user/sator/hmmak/spack/var/spack/environments/rlfx/.spack-env/view/include/dolfinx/la/utils.h:9:10: fatal error: concepts: No such file or directory
#include <concepts>
^~~~~~~~~~
compilation terminated.
where concepts seems to not exist in gcc 10. But once loaded multiphenicsx installed fine.
I will now test if the instability still occurs with this installation.
You may want to explicitly pass āCXX=your_compiler VERBOSE=1 pip install -vā¦ā when installing multiphenicsx to see if that helps pip in using the compiler you installed.
I think the problem is persisting despite this. Iām getting errors with xdmf.read_mesh that are outputting
[n347:52168] PSM2 returned unhandled/unknown connect error: Operation timed out
[n347:52168] PSM2 EP connect error (unknown connect error):
[n347:52168] n256
[n347:52168]
[212]PETSC ERROR: ------------------------------------------------------------------------
[212]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[212]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[212]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[212]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[212]PETSC ERROR: to get more information on the crash.
[212]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 212 in communicator MPI_COMM_WORLD
with errorcode 59.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
My systemās local MPI is openmpi 4.1.1, as opposed to previously where mpich was installed with anaconda.
Moreover, this is perhaps related to the issue I raised on petsc4py while on anaconda where there was also instability with a large number of processors (through more testing perhaps actually not necessarily processors but rather number of nodes). Especially now with the new install I get a similar error from petsc4py.init() as xdmf.read_mesh as follows
[n347:43460] PSM2 returned unhandled/unknown connect error: Operation timed out
[n347:43460] PSM2 EP connect error (unknown connect error):
[n347:43460] n136
[n347:43460]
[n347:43460] *** Process received signal ***
[n347:43460] Signal: Segmentation fault (11)
[n347:43460] Signal code: Address not mapped (1)
[n347:43460] Failing at address: 0x38
[n347:43460] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x14fd1816cb20]
[n347:43460] [ 1] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/openmpi/mca_mtl_psm2.so(ompi_mtl_psm2_send+0x266)[0x14fcf73c4f16]
[n347:43460] [ 2] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/openmpi/mca_pml_cm.so(+0x3c3a)[0x14fcf75cbc3a]
[n347:43460] [ 3] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/libmpi.so.40(ompi_coll_base_barrier_intra_tree+0xe6)[0x14fd031b99a6]
[n347:43460] [ 4] /opt/tools/openmpi/4.1.1-gnu831-hpc/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x14fd0316ed48]
[n347:43460] [ 5] /tmp_user/sator/hmmak/spack/var/spack/environments/rlfx/.spack-env/view/lib/python3.11/site-packages/mpi4py/MPI.cpython-311-x86_64-linux-gnu.so(+0x58ac1)[0x14fd03487ac1]
[n347:43460] [ 6] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x16ac3e)[0x14fd186f7c3e]
[n347:43460] [ 7] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyObject_Vectorcall+0x34)[0x14fd186eb804]
[n347:43460] [ 8] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x3e23)[0x14fd1868e953]
[n347:43460] [ 9] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyEval_EvalCode+0x217)[0x14fd187e3127]
[n347:43460] [10] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x252080)[0x14fd187df080]
[n347:43460] [11] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x1add82)[0x14fd1873ad82]
[n347:43460] [12] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x707f)[0x14fd18691baf]
[n347:43460] [13] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x256279)[0x14fd187e3279]
[n347:43460] [14] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x15ea4b)[0x14fd186eba4b]
[n347:43460] [15] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyObject_CallMethodObjArgs+0xf0)[0x14fd186ebc30]
[n347:43460] [16] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyImport_ImportModuleLevelObject+0x4c1)[0x14fd18810ea1]
[n347:43460] [17] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0xac0e)[0x14fd1869573e]
[n347:43460] [18] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyEval_EvalCode+0x217)[0x14fd187e3127]
[n347:43460] [19] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x252080)[0x14fd187df080]
[n347:43460] [20] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x1add82)[0x14fd1873ad82]
[n347:43460] [21] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x707f)[0x14fd18691baf]
[n347:43460] [22] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x256279)[0x14fd187e3279]
[n347:43460] [23] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x15ea4b)[0x14fd186eba4b]
[n347:43460] [24] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyObject_CallMethodObjArgs+0xf0)[0x14fd186ebc30]
[n347:43460] [25] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyImport_ImportModuleLevelObject+0x4c1)[0x14fd18810ea1]
[n347:43460] [26] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0xac0e)[0x14fd1869573e]
[n347:43460] [27] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(PyEval_EvalCode+0x217)[0x14fd187e3127]
[n347:43460] [28] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(+0x29ee2d)[0x14fd1882be2d]
[n347:43460] [29] /tmp_user/sator/hmmak/spack/opt/spack/linux-centos8-broadwell/gcc-11.4.0/python-3.11.7-sl3mqtufyi7sh5osh26yqyuhkjnsjze4/lib/libpython3.11.so.1.0(_PyRun_SimpleFileObject+0x14f)[0x14fd1882d5af]
[n347:43460] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 320 with PID 43460 on node n347 exited on signal 11 (Segmentation fault).
I also had a problem with the spack install with a specific mesh that worked in anaconda (with the same number of node/processor combination) where it would completely stall with multiphenicsx.fem.petsc.create_vector_block(Fform, restriction = restriction). No output message at all; it would just run forever.