Hello everyone,
I am running FEniCSx (dolfinx) simulations on an HPC cluster and recently encountered issues related to JIT caching and NFS load. The cluster administrators observed a large number of NFS operations originating from the FEniCS cache directory in my HOME filesystem. The simulation jobs were repeatedly accessing the default cache location ~/.cache/fenics which caused heavy load on the NFS filesystem.
Example log messages from the NFS server are,
NFSv4.1 OP_OPEN NFS4ERR_EXIST
File:/export/home_b/home_9/…/.cache/fenics
Software versions used are:
Fenics dolfinx version: 0.8
Python version: 3.10
The cluster setup details are:
- HPC cluster using SLURM
- HOME directory located on NFS
- Parallel workspace available at /data/…
- Node-local memory available at /dev/shm
- Using conda environment with fenicsx
I tried to redirect the cache by setting the environment variable XDG_CACHE_HOME inside the SLURM job script:
‘’’
export FENICS_CACHE_BASE=“/dev/shm/USER/fenics_{SLURM_JOB_ID}”
mkdir -p “$FENICS_CACHE_BASE”
export XDG_CACHE_HOME=“$FENICS_CACHE_BASE/.cache”
mkdir -p “$XDG_CACHE_HOME”
‘’’
This would write the cache files to the compute node local storage. However, I observe that it was still accessing the cache directory in HOME (~/.cache/fenics), which appears to trigger the NFS errors reported by the cluster administrators.
Hence, I would like to know if setting XDG_CACHE_HOME the correct and recommended way to redirect the FEniCS JIT cache? And is it expected that FEniCS may still access ~/.cache/fenics even after redirecting the cache location?
Also would it be better to install the conda environment in the parallel filesystem workspace (/data/…) instead of HOME to avoid NFS activity?
My goal is to configure the Fenics cache in a way that avoids NFS load and follows best HPC practices. Any recommendations or examples would be greatly appreciated.
Thank you very much for your help!