Hi,
I am having trouble running dolfinx in parallel on a HPC cluster. To begin, I pull the latest image from docker hub
singularity pull --name dolfinx.sif docker://dolfinx/dolfinx:latest
then try to run the following MWE:
from dolfinx.io import XDMFFile
from dolfinx import fem
from dolfinx.mesh import create_unit_cube, CellType
from mpi4py import MPI
comm = MPI.COMM_WORLD
mesh = create_unit_cube(comm, 50, 50, 50, CellType.tetrahedron)
V = fem.FunctionSpace(mesh, ("CG", 1))
u = fem.Function(V, name="u")
with XDMFFile(mesh.comm, "test.xdmf", "w") as f:
f.write_mesh(mesh)
f.write_function(u, 0.)
It runs without any issues in serial (after appending the necessary $PYTHONPATH
). However, when I try to run in parallel using slurm
, specifically the following batch file, it fails to create the HDF5
file.
#!/bin/bash
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=22
#SBATCH --job-name=mwe
#SBATCH --partition=meca
cd $SLURM_SUBMIT_DIR
echo $SLURM_SUBMIT_DIR
export JOBID=`echo $SLURM_JOB_ID | cut -d"." -f1`
export OMP_NUM_THREADS=1
module load openmpi/4.0.5-intel-18.0
module load singularity/3.2.0
export PYTHONPATH="/usr/local/dolfinx-real/lib/python3.10/dist-packages:/usr/local/lib:$PYTHONPATH"
echo $PYTHONPATH
mpiexec -n 22 singularity exec -B $SLURM_SUBMIT_DIR:/home/bshrima2/dolfinxSimul $SLURM_SUBMIT_DIR/dolfinx.sif python3 -u /home/bshrima2/dolfinxSimul/mwe.py &> $SLURM_SUBMIT_DIR/mwe_${JOBID}.oe
# singularity exec -B $SLURM_SUBMIT_DIR:/home/bshrima2/dolfinxSimul $SLURM_SUBMIT_DIR/dolfinx.sif python3 -u /home/bshrima2/dolfinxSimul/mwe.py &> $SLURM_SUBMIT_DIR/mwe_${JOBID}.oe
which throws
HDF5 error
HDF5-DIAG: Error detected in HDF5 (1.12.2) MPI-process 0:
#000: H5F.c line 532 in H5Fcreate(): unable to create file
major: File accessibility
minor: Unable to open file
#001: H5VLcallback.c line 3282 in H5VL_file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#002: H5VLcallback.c line 3248 in H5VL__file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLnative_file.c line 63 in H5VL__native_file_create(): unable to create file
major: File accessibility
minor: Unable to open file
#004: H5Fint.c line 1898 in H5F_open(): unable to lock the file
major: File accessibility
minor: Unable to lock file
#005: H5FD.c line 1625 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Unable to lock file
#006: H5FDsec2.c line 1002 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
major: Virtual File Layer
minor: Unable to lock file
Traceback (most recent call last):
HDF5-DIAG: Error detected in HDF5 (1.12.2) MPI-process 0:
#000: H5F.c line 532 in H5Fcreate(): unable to create file
major: File accessibility
minor: Unable to open file
#001: H5VLcallback.c line 3282 in H5VL_file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#002: H5VLcallback.c line 3248 in H5VL__file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLnative_file.c line 63 in H5VL__native_file_create(): unable to create file
major: File accessibility
minor: Unable to open file
#004: H5Fint.c line 1898 in H5F_open(): unable to lock the file
major: File accessibility
minor: Unable to lock file
#005: H5FD.c line 1625 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Unable to lock file
#006: H5FDsec2.c line 1002 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
major: Virtual File Layer
minor: Unable to lock file
File "/home/bshrima2/dolfinxSimul/mwe.py", line 11, in <module>
with XDMFFile(mesh.comm, "test.xdmf", "w") as f:
RuntimeError: Failed to create HDF5 file.
Traceback (most recent call last):
File "/home/bshrima2/dolfinxSimul/mwe.py", line 11, in <module>
with XDMFFile(mesh.comm, "test.xdmf", "w") as f:
RuntimeError: Failed to create HDF5 file.
HDF5-DIAG: Error detected in HDF5 (1.12.2) MPI-process 0:
#000: H5F.c line 532 in H5Fcreate(): unable to create file
major: File accessibility
minor: Unable to open file
#001: H5VLcallback.c line 3282 in H5VL_file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#002: H5VLcallback.c line 3248 in H5VL__file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLnative_file.c line 63 in H5VL__native_file_create(): unable to create file
major: File accessibility
minor: Unable to open file
#004: H5Fint.c line 1898 in H5F_open(): unable to lock the file
major: File accessibility
minor: Unable to lock file
#005: H5FD.c line 1625 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Unable to lock file
#006: H5FDsec2.c line 1002 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
major: Virtual File Layer
minor: Unable to lock file
HDF5-DIAG: Error detected in HDF5 (1.12.2) MPI-process 0:
#000: H5F.c line 532 in H5Fcreate(): unable to create file
major: File accessibility
minor: Unable to open file
#001: H5VLcallback.c line 3282 in H5VL_file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#002: H5VLcallback.c line 3248 in H5VL__file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLnative_file.c line 63 in H5VL__native_file_create(): unable to create file
major: File accessibility
minor: Unable to open file
#004: H5Fint.c line 1898 in H5F_open(): unable to lock the file
major: File accessibility
minor: Unable to lock file
#005: H5FD.c line 1625 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Unable to lock file
#006: H5FDsec2.c line 1002 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
major: Virtual File Layer
minor: Unable to lock file
Traceback (most recent call last):
File "/home/bshrima2/dolfinxSimul/mwe.py", line 11, in <module>
with XDMFFile(mesh.comm, "test.xdmf", "w") as f:
RuntimeError: Failed to create HDF5 file.
HDF5-DIAG: Error detected in HDF5 (1.12.2) MPI-process 0:
#000: H5F.c line 532 in H5Fcreate(): unable to create file
major: File accessibility
minor: Unable to open file
#001: H5VLcallback.c line 3282 in H5VL_file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#002: H5VLcallback.c line 3248 in H5VL__file_create(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLnative_file.c line 63 in H5VL__native_file_create(): unable to create file
major: File accessibility
minor: Unable to open file
#004: H5Fint.c line 1898 in H5F_open(): unable to lock the file
major: File accessibility
minor: Unable to lock file
#005: H5FD.c line 1625 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Unable to lock file
#006: H5FDsec2.c line 1002 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
major: Virtual File Layer
minor: Unable to lock file
Traceback (most recent call last):
File "/home/bshrima2/dolfinxSimul/mwe.py", line 11, in <module>
with XDMFFile(mesh.comm, "test.xdmf", "w") as f:
RuntimeError: Failed to create HDF5 file.
Traceback (most recent call last):
File "/home/bshrima2/dolfinxSimul/mwe.py", line 11, in <module>
with XDMFFile(mesh.comm, "test.xdmf", "w") as f:
RuntimeError: Failed to create HDF5 file.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55593,1],6]
Exit code: 1
--------------------------------------------------------------------------
Any clues?
Thanks,
Bhavesh