Does parallelization help mumps?

I’m working with hyperelasticity in dolfin. Is there any benefit in using parallelization tools commonly used in fenics examples if I use mumps as linear solver? Is not mumps supposed to be already parallel?

Mumps is a direct solver that supports mpi parallelisation. Thus if you change the factorization backend for the direct solver in DOLFIN/DOLFINx to mumps, it can yield speedups. However, it depends on the size of the problem, if the communication overhead added by communicating between different process is more expensive than solve the system (which can happen for small systems), you might not see a speedup, or see a speedup when going from 1 to 2 processes, but not a speedup going from 2 to 4 processes.

Just subjective, but I’ve achieved decent parallel performance as measured by strong scaling with MUMPS going up to about 10M DoF for 2D problems. Specifically this was Stokes coupled with the heat equation.

I’ve not been able to demonstrate good strong scaling for 3D problems. Perhaps others have suggestions.

See also How to choose the optimal solver for a PDE problem? - #2 by nate

I can offer some qualitative observations based on a homebrew PC cluster.

For 3D problems, I have noted that if the DoFs fit in the memory of your computer (i.e. no network communication between compute nodes - only internal processor cores), I see speedup up to 4 cores using MPI. More than this, and the speedup exhibits diminishing returns.

For problems that need to distribute memory over many nodes, i.e. network communications is needed, it is best to use maybe 2 or 4 processors per node. Don’t use too many nodes for small problems. There is a balance between the speed gained by using more cores and the communication required between the nodes, which slows things down. Using bigger “chunks” reduces the communications (fewer SSH background sessions exchanging info over your network). If your simulation is broken up into many small chunks, the communications overhead begins to kill your processing performance. Experiment a bit. 2-3 cores on each node should be sufficient, but it is possible that performance will be quite problem dependent.