Slurm run using Singularity with FEniCS and MFront

I have a working FEniCS 2019.2.dev0 +TFEL/MFront code that runs well within a single workstation. When I use a mesh of 100-400k elements the program runs well without any runtime errors. But because of the large mesh of 1.5+ million elements (I am presently doing a mesh convergence study) I am trying to run it on the university cluster.

The cluster has a small $HOME directory (about 10 GB) and much of the space allocated is in the $WORKDIR.

When I launch my code,using the following SBatch command:

#!/bin/bash

## BEGIN SBATCH directives
#SBATCH --job-name=VP
#SBATCH --output=STD_Output_07.txt
#SBATCH --error=STD_Error_07.txt
#SBATCH --account=LMS
#
#SBATCH --ntasks=400
#SBATCH --hint=nomultithread 
##SBATCH --mem-per-cpu=8G
#SBATCH --mem=0
#SBATCH --partition=cpu_dist
##SBATCH --nodefile=/mnt/beegfs/workdir/user/Singularity/MeshConvergenceTest/Nodelist.txt
#SBATCH --time=24:00:00
## END SBATCH directives



module purge
module load singularity/3.4.1
module load gcc/10.2.0 
module load openmpi/4.1.0

export HOME=/mnt/beegfs/workdir/user
export SINGULARITY_CACHEDIR=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/singularity
export SINGULARITY_TMPDIR=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/singularity_temp
export SINGULARITY_HOME=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/singularity_temp
export DIJITSO_CACHE_DIR=/mnt/beegfs/workdir/user/Singularity/TEMP/.cache/dijitso


mpirun -n $SLURM_NTASKS \
	singularity exec -H /mnt/beegfs/workdir/user \
	-B $PWD,/mnt/beegfs/workdir/user/Singularity/ /mnt/beegfs/workdir/user/Singularity/fenics-mfront.simg bash -c \
	'source /home/fenics/.local/codes/mgis/master/install/env.sh; python3 Viscoplasticity.py'

the program either freezes in the initialization phase/assembly or it gives a crash on the error out file:

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              node051
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node051
  Local device: mlx5_0
--------------------------------------------------------------------------
[node048:166149] 12 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node048:166149] 12 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 11 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 29 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 29 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 139 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 139 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node048:166149] 117 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[node048:166149] 117 more processes have sent help message help-mpi-btl-openib.txt / error in device init
mlx5: node056: got completion with error:
mlx5: node056: got completion with error:
mlx5: node056: got completion with error:
mlx5: node055: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
000000cf 00000000 00000000 00000000
00000000 00008914 10013754 0e5c4fd3
mlx5: node055: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000060 00000000 00000000 00000000
00000000 00008914 1000f4e0 0faf6cd3
00000000 00000000 00000000 00000000
mlx5: node055: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000066 00000000 00000000 00000000
00000000 00008914 1000f5dd 0fd125d3
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000094 00000000 00000000 00000000
00000000 00008914 10012dbd 0f2f45d3
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000004c 00000000 00000000 00000000
00000000 00008914 1000f14e 0eb8d1d3
00000000 00000000 00000000 00000000
...
...
...

bash: line 1: 134135 Aborted                 python3 Viscoplasticity.py
SIGABRT: abort
PC=0x47cdab m=0 sigcode=0

goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x20586, 0x6, 0x0, 0x0, 0xc00008e000, 0xc00008e000)
	/usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc000169e70 sp=0xc000169e68 pc=0x47cdab
syscall.Kill(0x20586, 0x6, 0x0, 0x0)
	/usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc000169eb8 sp=0xc000169e70 pc=0x479bcb
github.com/sylabs/singularity/internal/app/starter.Master.func2()
	internal/app/starter/master_linux.go:152 +0x61 fp=0xc000169f00 sp=0xc000169eb8 pc=0x7928f1
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
	internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc000169f28 sp=0xc000169f00 pc=0x790f4f
main.main()
	cmd/starter/main_linux.go:102 +0x5f fp=0xc000169f60 sp=0xc000169f28 pc=0x972bbf
runtime.main()
	/usr/lib/golang/src/runtime/proc.go:203 +0x21e fp=0xc000169fe0 sp=0xc000169f60 pc=0x433b4e
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc000169fe8 sp=0xc000169fe0 pc=0x45f7c1

goroutine 6 [syscall]:
os/signal.signal_recv(0xb9da80)
	/usr/lib/golang/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
	/usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41

goroutine 8 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0002c4ff0)
	internal/pkg/util/mainthread/mainthread.go:24 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x7, 0x4, 0x20866, 0xc00000f2a0)
	internal/app/starter/master_linux.go:151 +0x44c
main.startup()
	cmd/starter/main_linux.go:75 +0x53e
created by main.main
	cmd/starter/main_linux.go:98 +0x35

rax    0x0
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x20586
rsi    0x6
rbp    0xc000169ea8
rsp    0xc000169e68
r8     0x0
r9     0x0
r10    0x0
r11    0x202
r12    0xf3
r13    0x0
r14    0xb83e88
r15    0x0
rip    0x47cdab
rflags 0x202
cs     0x33
fs     0x0
gs     0x0
bash: line 1: 134121 Segmentation fault      python3 Viscoplasticity.py
bash: line 1: 134137 Killed                  python3 Viscoplasticity.py
bash: line 1: 134129 Segmentation fault      python3 Viscoplasticity.py
bash: line 1: 134110 Segmentation fault      python3 Viscoplasticity.py
bash: line 1: 134104 Aborted                 python3 Viscoplasticity.py
SIGABRT: abort
PC=0x47cdab m=0 sigcode=0

goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x20584, 0x6, 0x0, 0x0, 0xc000090000, 0xc000090000)
	/usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc0001a9e70 sp=0xc0001a9e68 pc=0x47cdab
syscall.Kill(0x20584, 0x6, 0x0, 0x0)
	/usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc0001a9eb8 sp=0xc0001a9e70 pc=0x479bcb
github.com/sylabs/singularity/internal/app/starter.Master.func2()
	internal/app/starter/master_linux.go:152 +0x61 fp=0xc0001a9f00 sp=0xc0001a9eb8 pc=0x7928f1
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
	internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc0001a9f28 sp=0xc0001a9f00 pc=0x790f4f
main.main()
	cmd/starter/main_linux.go:102 +0x5f fp=0xc0001a9f60 sp=0xc0001a9f28 pc=0x972bbf
runtime.main()
	/usr/lib/golang/src/runtime/proc.go:203 +0x21e fp=0xc0001a9fe0 sp=0xc0001a9f60 pc=0x433b4e
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc0001a9fe8 sp=0xc0001a9fe0 pc=0x45f7c1

goroutine 6 [syscall]:
os/signal.signal_recv(0xb9da80)
	/usr/lib/golang/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
	/usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41

goroutine 8 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0002d8ff0)
	internal/pkg/util/mainthread/mainthread.go:24 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x7, 0x4, 0x20876, 0xc00000f240)
	internal/app/starter/master_linux.go:151 +0x44c
main.startup()
	cmd/starter/main_linux.go:75 +0x53e
created by main.main
	cmd/starter/main_linux.go:98 +0x35

rax    0x0
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x20584
rsi    0x6
rbp    0xc0001a9ea8
rsp    0xc0001a9e68
r8     0x0
r9     0x0
r10    0x0
r11    0x202
r12    0xf3
r13    0x0
r14    0xb83e88
r15    0x0
rip    0x47cdab
rflags 0x202
cs     0x33
fs     0x0
gs     0x0
bash: line 1: 134143 Segmentation fault      python3 Viscoplasticity.py
bash: line 1: 134139 Segmentation fault      python3 Viscoplasticity.py
bash: line 1: 134142 Segmentation fault      python3 Viscoplasticity.py
bash: line 1: 134140 Segmentation fault      python3 Viscoplasticity.py
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[8191,1],300]
  Exit code:    139
--------------------------------------------------------------------------

Within the cluster, smaller meshes run without any issues, but for larger meshes the run crashes.
Could this be a problem with the Singularity container or any other issues?

Has anyone encountered such a problem before? Or how to solve this?

Thank you in advance!