Very slow solve time on a kubernetes JupyterHub

Hello,

I am developing a Jupyter notebook based exercise for a remote lab course that uses FEniCS/dofin and some associated packages. Unfortunately it runs impossibly slow on the university’s datahub. It seems this could be an issue with the installation on the server, the configuration of the server, or the solver options selected.

Are there any known issues for installing FEniCS using a conda-forge on a JupyterHub?

When I run the notebook on my local machine (MacBookPro - 1 processor, 2.3 GHz, 8GB) it runs in about 3 minutes. When I run the same script on my university’s kubernetes configured JupyterHub, the program gets stuck at the linear solve. Each Newton iteration takes ~10 minutes on the Jupyterhub whereas they take only seconds on my local machine.

I am installing FEniCS using conda-forge, and I am using the same yml file for my local machine and the JupyterHub.

I have tried increasing the CPU and memory available on the Jupyterhub, but it did not fix the problem. A simple problems (ft01_poisson.py) solves successfully to completion. Both systems have similar logs during the solve. Below is an example from the first solve step. The linear solver appears to by umfpack.

Should I try changing the linear solver or solver parameters?

Solving nonlinear variational problem.
  Newton iteration 0: r (abs) = 4.607e-03 (tol = 1.000e-10) r (rel) = 1.000e+00 (tol = 1.000e-09)
  Solving linear system of size 7633 x 7633 (PETSc LU solver, umfpack).
  PETSc Krylov solver starting to solve 7633 x 7633 system.
  Newton iteration 1: r (abs) = 1.527e-03 (tol = 1.000e-10) r (rel) = 3.314e-01 (tol = 1.000e-09)
  Solving linear system of size 7633 x 7633 (PETSc LU solver, umfpack).

Thank you for any suggestions or guidance.

Here is an update I received from someone helping on the cluster side

It looks like we narrowed down the problem to high amounts of memory mappings over NFS (how our filesystem mounting works). We figured it was due to the I/O at this directory ~/.cache since it was mutating as the code was running, although changing the configuration by moving this directory to memory (rather than the filesystem) still hasn’t solved the problem.

Perhaps someone on here might know where to go with this.

eta: I found this previous Q that may be related: My codes work in serial, but not in parallel due to issues with cache files. There isn’t a clear resolution, but we can try limiting to a single CPU and see if that fixes the problem.

I was able to change the linear solver backend to the default ‘Eigen’ and the problem solved successfully on the JupyterHub cluster. I was never able to find a solution to this issue. Unfortunately, removing hyperthreading didn’t do the trick.