I have a rather heavy nonlinear PDE problem I’d like to solve in a parallel manner. I can get a decent speedup by using a parallelized linear solver by setting ksp.getPC().setFactorSolverType("mumps"). However, then the assembly starts to become a bottleneck. I’d like to use the MPI support of FEniCSx, but although it seems to work, it gives no speedup (only slows down the calculation).
I also tested the hyperelasticity example from the tutorial Hyperelasticity — FEniCSx tutorial (as it’s nonlinear and takes some time to run) with m̀pirun -n 2 or mpirun -n 4 but it only slows down the calculation quite dramatically.
Are other people able to get a speedup with mpirun? Are there examples or demos that are supposed to show a speedup when run with multiple processes?
I know the hyperelasticity example could be a bit small for getting a lot of speedup with multiprocessing. It actually slows down by a factor of more than 2 when I run with two processes.
However, my actual simulation code speeds up many times when I set ksp.getPC().setFactorSolverType("mumps"). I can see from the system monitor that it starts using all my cores during the linear solver phase. I would expect to see a similar speedup when using MPI and the regular PETSc solver…
It would be convenient if there were a “parallelization example” with reference timings. That could serve as a standard testing code for the MPI environment.
I could consider updating the distro if there was a known issue, but probably not as a “did you try turning it off and on again” type of solution. Usually Conda environments seem to be quite insulated from the system environment and I haven’t had distro related issues in the past.
As seen in the recent other post, setting openmp threads might help you,
It would also help to see a breakdown of timings, what is the time of assembling compared to solving the linear system inside the nonlinear Newton solver?
For linear equations, we have extensive profiling at;
with plots available at, where one can check the scalability of assembly, solve etc over time and from one to sixteen nodes (1 node = multiple processes) https://fenics.github.io/performance-test-results/
Thanks, that helped (OMP_NUM_THREADS=1)! Now the hyperelasticity example is sped up by more processes - actually more than it should! My timings (with plotting output disabled):
np run time
1 129.370s
2 24.966s
4 18.260s
(run with OMP_NUM_THREADS=1 mpirun -n $np python hyperelasticity_example.py)
Somehow the single process run is very slow compared to two processes? Can’t think of a logical explanation.
For anyone facing the same issue, a convenient way to permanently disable the threads with conda is running conda env config vars set OMP_NUM_THREADS=1 in the fenics environment.
Thanks for the link to the refence benchmarks… I’d say though they could be more useful if they also showed comparisons with different numbers of processes with one node. Now it just tells me that adding nodes only slows down the calculations.
They perform weak scaling, ie keep the number of degrees of freedom constant per process.
This means that going from 1-> 2 nodes means doubling the degrees of freedom in total. Optimal scaling would then be a constant runtime across all runs, but that rarely happens as networking + partitioning doesn’t always yield the optimal layout.
Ref: Scaling - HPC Wiki
I’ve shown strong scaling in the various posts referred to in:
as well as there are several papers on the subject:
Thanks for the clarification! It seems like the assembly part is doing well in the benchmark but the linear solve can be a bottleneck depending on the problem type (elastic vs. poisson).
Any comment on why I might be getting such low performance with single process in the hyperelasticity example compared to several processes? This is not a question I’ve seen arise before…
Ok the reason for the bad performance with one process seems to be the default PETSc LU solver. When I switch to mumps, the performance is as expected compared to 2 processes. With 2 or more processes, the default solver performs similarly as mumps. Can anyone else reproduce this?
A small note: mumps uses omp threads on my system when OMP_NUM_THREADS is not 1. This gives an up to 2x speedup to the linear solve in my problem (when running without mpirun). So there’s some benefit to it.
Hmm ok, that explains it. The PETSc LU solver is the default in NewtonSolver, so it’s performance is not irrelevant though. It seems many, many times slower than mumps in my problem. Mumps would take ~20 seconds, but PETSc solver took so long I got tired of waiting and didn’t even get a timing.
I might suggest changing the default in NewtonSolver to “mumps”. Or remove any default KSP settings altogether if the responsibility should be left to the user.
Setting mumps as default is a bit more problematic, as it is not guaranteed that PETSc is installed with mumps, meaning that we would need to add a list of “prioritized solvers” : “mumps”->“superlu”/“superlu_dist”->“petsc” (or something along those lines).
Leaving it as the PETSc default has been proven problematic, as many users don’t realize that they should configure their solver.
I think default PETSc is “gmres” + “icc” in serial and
“gmres” + “bjacobi” in parallel which will probably exhibit similar differences in scaling.