MPI acceleration with FEniCSx

As seen in the recent other post, setting openmp threads might help you,

It would also help to see a breakdown of timings, what is the time of assembling compared to solving the linear system inside the nonlinear Newton solver?

For linear equations, we have extensive profiling at;

with plots available at, where one can check the scalability of assembly, solve etc over time and from one to sixteen nodes (1 node = multiple processes)
https://fenics.github.io/performance-test-results/