Caching in Fenicsx and NFS error

Hello everyone,

I am running FEniCSx (dolfinx) simulations on an HPC cluster and recently encountered issues related to JIT caching and NFS load. The cluster administrators observed a large number of NFS operations originating from the FEniCS cache directory in my HOME filesystem. The simulation jobs were repeatedly accessing the default cache location ~/.cache/fenics which caused heavy load on the NFS filesystem.

Example log messages from the NFS server are,

NFSv4.1 OP_OPEN NFS4ERR_EXIST
File:/export/home_b/home_9/…/.cache/fenics

Software versions used are:
Fenics dolfinx version: 0.8
Python version: 3.10

The cluster setup details are:

  • HPC cluster using SLURM
  • HOME directory located on NFS
  • Parallel workspace available at /data/…
  • Node-local memory available at /dev/shm
  • Using conda environment with fenicsx

I tried to redirect the cache by setting the environment variable XDG_CACHE_HOME inside the SLURM job script:

‘’’
export FENICS_CACHE_BASE=“/dev/shm/USER/fenics_{SLURM_JOB_ID}”
mkdir -p “$FENICS_CACHE_BASE”

export XDG_CACHE_HOME=“$FENICS_CACHE_BASE/.cache”
mkdir -p “$XDG_CACHE_HOME”
‘’’
This would write the cache files to the compute node local storage. However, I observe that it was still accessing the cache directory in HOME (~/.cache/fenics), which appears to trigger the NFS errors reported by the cluster administrators.
Hence, I would like to know if setting XDG_CACHE_HOME the correct and recommended way to redirect the FEniCS JIT cache? And is it expected that FEniCS may still access ~/.cache/fenics even after redirecting the cache location?
Also would it be better to install the conda environment in the parallel filesystem workspace (/data/…) instead of HOME to avoid NFS activity?

My goal is to configure the Fenics cache in a way that avoids NFS load and follows best HPC practices. Any recommendations or examples would be greatly appreciated.

Thank you very much for your help!

Ideally the jit files in the cache should be generated once and once only for any expression or weak form in your system. The problem you’re describing suggests you’re constructing weak forms repeatedly in the course of your calculation, for instance redefining the form inside a time loop. The strategy would be to move the definition of the form outside the loop.

The problem can happen for instance if your form involves a constant, perhaps a time-varying coefficient that is updated at every time step. In this case the strategy would be to define the coefficient inside a fenics (ufl) Constant object.

1 Like

To learn about best practices for generating forms, as Drew describes, see for instance:

3 Likes

Thanks for the suggestion!