I have a working FEniCS 2019.2.dev0 +TFEL/MFront code that runs well within a single workstation. When I use a mesh of 100-400k elements the program runs well without any runtime errors. But because of the large mesh of 1.5+ million elements (I am presently doing a mesh convergence study) I am trying to run it on the university cluster.
The cluster has a small $HOME directory (about 10 GB) and much of the space allocated is in the $WORKDIR.
When I launch my code,using the following SBatch command:
#!/bin/bash
## BEGIN SBATCH directives
#SBATCH --job-name=VP
#SBATCH --output=STD_Output_07.txt
#SBATCH --error=STD_Error_07.txt
#SBATCH --account=LMS
#
#SBATCH --ntasks=400
#SBATCH --hint=nomultithread
##SBATCH --mem-per-cpu=8G
#SBATCH --mem=0
#SBATCH --partition=cpu_dist
##SBATCH --nodefile=/mnt/beegfs/workdir/user/Singularity/MeshConvergenceTest/Nodelist.txt
#SBATCH --time=24:00:00
## END SBATCH directives
module purge
module load singularity/3.4.1
module load gcc/10.2.0
module load openmpi/4.1.0
export HOME=/mnt/beegfs/workdir/user
export SINGULARITY_CACHEDIR=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/singularity
export SINGULARITY_TMPDIR=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/singularity_temp
export SINGULARITY_HOME=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/singularity_temp
export DIJITSO_CACHE_DIR=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/dijitso
mpirun -n $SLURM_NTASKS \
singularity exec -H /mnt/beegfs/workdir/user \
-B $PWD,/mnt/beegfs/workdir/user/Singularity/ /mnt/beegfs/workdir/user/Singularity/fenics-mfront.simg bash -c \
'source /home/fenics/.local/codes/mgis/master/install/env.sh; python3 Viscoplasticity.py'
the program either freezes in the initialization phase/assembly or it gives a crash on the error out file:
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: node051
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: node051
Local device: mlx5_0
--------------------------------------------------------------------------
[node048:166149] 12 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node048:166149] 12 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 29 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 29 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 139 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 139 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 117 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 117 more processes have sent help message help-mpi-btl-openib.txt / error in device init
mlx5: node056: got completion with error:
mlx5: node056: got completion with error:
mlx5: node056: got completion with error:
mlx5: node055: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
000000cf 00000000 00000000 00000000
00000000 00008914 10013754 0e5c4fd3
mlx5: node055: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000060 00000000 00000000 00000000
00000000 00008914 1000f4e0 0faf6cd3
00000000 00000000 00000000 00000000
mlx5: node055: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000066 00000000 00000000 00000000
00000000 00008914 1000f5dd 0fd125d3
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000094 00000000 00000000 00000000
00000000 00008914 10012dbd 0f2f45d3
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000004c 00000000 00000000 00000000
00000000 00008914 1000f14e 0eb8d1d3
00000000 00000000 00000000 00000000
...
...
...
bash: line 1: 134135 Aborted python3 Viscoplasticity.py
SIGABRT: abort
PC=0x47cdab m=0 sigcode=0
goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x20586, 0x6, 0x0, 0x0, 0xc00008e000, 0xc00008e000)
/usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc000169e70 sp=0xc000169e68 pc=0x47cdab
syscall.Kill(0x20586, 0x6, 0x0, 0x0)
/usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc000169eb8 sp=0xc000169e70 pc=0x479bcb
github.com/sylabs/singularity/internal/app/starter.Master.func2()
internal/app/starter/master_linux.go:152 +0x61 fp=0xc000169f00 sp=0xc000169eb8 pc=0x7928f1
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc000169f28 sp=0xc000169f00 pc=0x790f4f
main.main()
cmd/starter/main_linux.go:102 +0x5f fp=0xc000169f60 sp=0xc000169f28 pc=0x972bbf
runtime.main()
/usr/lib/golang/src/runtime/proc.go:203 +0x21e fp=0xc000169fe0 sp=0xc000169f60 pc=0x433b4e
runtime.goexit()
/usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc000169fe8 sp=0xc000169fe0 pc=0x45f7c1
goroutine 6 [syscall]:
os/signal.signal_recv(0xb9da80)
/usr/lib/golang/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
/usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
/usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41
goroutine 8 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0002c4ff0)
internal/pkg/util/mainthread/mainthread.go:24 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x7, 0x4, 0x20866, 0xc00000f2a0)
internal/app/starter/master_linux.go:151 +0x44c
main.startup()
cmd/starter/main_linux.go:75 +0x53e
created by main.main
cmd/starter/main_linux.go:98 +0x35
rax 0x0
rbx 0x0
rcx 0xffffffffffffffff
rdx 0x0
rdi 0x20586
rsi 0x6
rbp 0xc000169ea8
rsp 0xc000169e68
r8 0x0
r9 0x0
r10 0x0
r11 0x202
r12 0xf3
r13 0x0
r14 0xb83e88
r15 0x0
rip 0x47cdab
rflags 0x202
cs 0x33
fs 0x0
gs 0x0
bash: line 1: 134121 Segmentation fault python3 Viscoplasticity.py
bash: line 1: 134137 Killed python3 Viscoplasticity.py
bash: line 1: 134129 Segmentation fault python3 Viscoplasticity.py
bash: line 1: 134110 Segmentation fault python3 Viscoplasticity.py
bash: line 1: 134104 Aborted python3 Viscoplasticity.py
SIGABRT: abort
PC=0x47cdab m=0 sigcode=0
goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x20584, 0x6, 0x0, 0x0, 0xc000090000, 0xc000090000)
/usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc0001a9e70 sp=0xc0001a9e68 pc=0x47cdab
syscall.Kill(0x20584, 0x6, 0x0, 0x0)
/usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc0001a9eb8 sp=0xc0001a9e70 pc=0x479bcb
github.com/sylabs/singularity/internal/app/starter.Master.func2()
internal/app/starter/master_linux.go:152 +0x61 fp=0xc0001a9f00 sp=0xc0001a9eb8 pc=0x7928f1
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc0001a9f28 sp=0xc0001a9f00 pc=0x790f4f
main.main()
cmd/starter/main_linux.go:102 +0x5f fp=0xc0001a9f60 sp=0xc0001a9f28 pc=0x972bbf
runtime.main()
/usr/lib/golang/src/runtime/proc.go:203 +0x21e fp=0xc0001a9fe0 sp=0xc0001a9f60 pc=0x433b4e
runtime.goexit()
/usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc0001a9fe8 sp=0xc0001a9fe0 pc=0x45f7c1
goroutine 6 [syscall]:
os/signal.signal_recv(0xb9da80)
/usr/lib/golang/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
/usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
/usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41
goroutine 8 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0002d8ff0)
internal/pkg/util/mainthread/mainthread.go:24 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x7, 0x4, 0x20876, 0xc00000f240)
internal/app/starter/master_linux.go:151 +0x44c
main.startup()
cmd/starter/main_linux.go:75 +0x53e
created by main.main
cmd/starter/main_linux.go:98 +0x35
rax 0x0
rbx 0x0
rcx 0xffffffffffffffff
rdx 0x0
rdi 0x20584
rsi 0x6
rbp 0xc0001a9ea8
rsp 0xc0001a9e68
r8 0x0
r9 0x0
r10 0x0
r11 0x202
r12 0xf3
r13 0x0
r14 0xb83e88
r15 0x0
rip 0x47cdab
rflags 0x202
cs 0x33
fs 0x0
gs 0x0
bash: line 1: 134143 Segmentation fault python3 Viscoplasticity.py
bash: line 1: 134139 Segmentation fault python3 Viscoplasticity.py
bash: line 1: 134142 Segmentation fault python3 Viscoplasticity.py
bash: line 1: 134140 Segmentation fault python3 Viscoplasticity.py
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[8191,1],300]
Exit code: 139
--------------------------------------------------------------------------
Within the cluster, smaller meshes run without any issues, but for larger meshes the run crashes.
Could this be a problem with the Singularity container or any other issues?
Has anyone encountered such a problem before? Or how to solve this?
Thank you in advance!