Here is an update I received from someone helping on the cluster side
It looks like we narrowed down the problem to high amounts of memory mappings over NFS (how our filesystem mounting works). We figured it was due to the I/O at this directory ~/.cache since it was mutating as the code was running, although changing the configuration by moving this directory to memory (rather than the filesystem) still hasn’t solved the problem.
Perhaps someone on here might know where to go with this.
eta: I found this previous Q that may be related: My codes work in serial, but not in parallel due to issues with cache files. There isn’t a clear resolution, but we can try limiting to a single CPU and see if that fixes the problem.