FEniCS + MPI on docker inefficient?

Hey,
I’m not sure if I’m using FEniCS + MPI + docker in the wrong way, if I am using inappropriate hardware, or if the setup doesn’t make sense:

I have started FEniCS in docker on a standard desktop 4-core i7 as described here:
https://fenics.readthedocs.io/projects/containers/en/latest/introduction.html#running-fenics-in-docker

I am running the example “demo/cpp/documented/poisson/cpp/main.cpp”, where I have changed the mesh size to “UnitSquareMesh::create({{512,512}}” and added high resolution timers around “solve(a == L, u, bc);” for profiling.

First I run serial “./demo_poisson”, then “mpirun -n 4 ./demo_poission”.
The serial solve takes about 2.3 seconds, the MPI run 2.5 seconds per process.

Should the parallel run not be at least 2-3 times faster than the serial run? I observed similar timings when I installed FEniCS using conda on the same machine natively, so that docker seems not to be the problem. I know that some problems are memory-bandwidth-limited when several processes access the same RAM. Can you recommend a demo, where a significant speedup is expected, if this is the most likely problem?

Thanks
Don

Hopefully someone else will provide you with the “short story”. But here are two references to provide detailed context.

Using containers for scalable and performant scientific computing

Scalable solvers for finite element problems using FEniCS

1 Like

Thank you for the links. However, I don’t find an explanation in there.
Yes, docker is not optimal in HPC environments, but I was asking about a single quad-core CPU. And even docker shows almost perfect efficiency in the HPGMG-FE benchmark as shown in the first paper. But I don’t find a discussion, why the runtime gets worse when going from 24 for 192 processors for the FEniCS benchmark (Fig. 3 of the first paper).

This seems to be a similar problem as I observed on a single CPU, but why??? Is this a FEniCS implementation problem?

Maybe this can be the reason for bad performance?

bmf, setting OMP_NUM_THREADS=1 is a very good point, but it didn’t change anything in my tests. I have tested a few other demos, and all of them run slower or ot most ~10% faster on 4 MPI processes on the same quadcore CPU compared to the serial run.

(This has nothing to do with docker. The behaviour is the same if I run a native conda installation.)

I’d appreciate if somebody could provide a simple FEniCS benchmark that shows good scalability on a single desktop CPU. I know that I might have configured FEniCS in the wrong way or that I am using the wrong solver settings. But if none of the demos profit from MPI, there seems to be a flaw somewhere in FEniCS.

Thanks
Don

As you are using the solve command, with default options, you cannot expect good performance for larger problems, as it uses a direct solver by default.

You should change the settings to use an iterative solver, with possibly multigrid as a preconditioner, see for instance:
https://bitbucket.org/fenics-project/dolfin/src/f0a90b8fffc958ed612484cd3465e59994b695f9/demo/documented/singular-poisson/cpp/main.cpp?at=master#lines-97

Thanks, but I also tested this one already. The walltime for the instruction “solver.solve(*u.vector(), b);” is 2.0 seconds for the serial run, and 3.8 seconds for the parallel run with “mpirun -n 4”. (mesh size increased to 512x512 for both).
Can you please try yourself and report your timing results?

I ran the python version of this demo on a computer with 50 processors, and 250 GB ram, then These are the results for (512,512):

fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 1 python3 demo_singular-poisson.py
Time: 1.55
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 2 python3 demo_singular-poisson.py
Time: 1.53
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 3 python3 demo_singular-poisson.py
Time: 1.10
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 4 python3 demo_singular-poisson.py
Time: 0.77
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 5 python3 demo_singular-poisson.py
Time: 0.64
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 6 python3 demo_singular-poisson.py
Time: 0.60
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 7 python3 demo_singular-poisson.py
Time: 0.57
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 8 python3 demo_singular-poisson.py
Time: 0.41
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 9 python3 demo_singular-poisson.py
Time: 0.43

Another thing to note is that the solve time is not very large, with quite a lot of variability of you run the code multiple times. (i used the quay.io/fenicsproject/dev:latest docker image).

fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 1 python3 demo_singular-poisson.py
Time: 1.55
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 1 python3 demo_singular-poisson.py
Time: 1.59
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 1 python3 demo_singular-poisson.py
Time: 1.58
fenics@1316788ba4f3:/home/shared/dolfin/python/demo/documented/singular-poisson$ mpirun -n 1 python3 demo_singular-poisson.py
Time: 1.62

A lot of what you might observe on your own system is that other usage of your processors are interfering with the performance of fenics.

Thank you for the detailed numbers. It’s good to see that timings improve on multi-processor machines, but it makes the numbers more difficult to compare to my case. I am pretty sure that other processes and timing variablity has no significant effect on my timings.

Your results show slightly better scalability than mine, but a speedup of about 2 when going from 1 to 4 processes is not really impressive. I’m not sure if this is special about the singular-poisson example or about FEniCS. But PETSc in general shows close to 100% efficiency (>98% or >99.8%, I am not sure) on poisson examples of ~1 seconds runtime, this I remember from our classes.

So, I’d like to see some benchmark with a speedup of at least 3.6 from 1 to 4 processes. Otherwise I’ll need to look for other FEM packages.

Thanks
Don

FEniCS is using PETSc under the hood for solving problems. As your problems are still of a relatively small size (less than 10^6 dofs), and a solution time of 1 second I dont really see a problem wrt. Scaling.

As you can see in the turbomachinery paper that nate referenced above, fenics has been used for HPC problems with over 200 million dofs, see figure 9, page 15.

Similar to your numbers:
384 procs: ~ 150 s
4x384 procs: ~ 70 s
Not usable.
Thanks, I’m off.

I like that you dont even look at the other results showing even better scaling, But its your choice. Good luck with finding a software suitable for your requirements. I would suggest Deal ii, Firedrake , Dune or Freefem++ of the top of my head. There is a big jungle out there and i have probably missed 50 other large software.

If I remember correctly, solve, without specifying a solver string, will use the default linear solver. In FEniCS, it seems to be lu, which is a serial direct solver.