Fail to run the demo code in parallel with MPI

Dear community,

I am trying to run the demo code of incompressible Navier-Stokes equations from Bitbucket in parallel with MPI.

However, according to the timing summary, there is no improvement in computational time with increasing number of processors. It seems like that the code is executed several times individually. The timing summary is listed in the following figure

The FEniCS is installed on Docker, and it is the latest stable version. The total number of cores in my computer is 8. Did I misuse the MPI command? Or should I add some extra codes to parallelize the demo?

Could you please guide me to some resources or tips which would help me solve this problem?

Thank you,
Best regards

Try running the following:

mpirun -np 2 python3 -c "from mpi4py import MPI; print(MPI.COMM_WORLD.rank)"

You should see

0
1

Thank you so much for your reply. I tried this command and got the same output as yours

mpirun -np 2 python3 -c "from mpi4py import MPI; print(MPI.COMM_WORLD.rank)"
0
1

The reason for the code not speeding up is that the problem is very small (1000 DOFS in velocity space and 100 in the pressure space). Running code in parallell is useful for large problems, as the mesh is partitioned and distributed over more processes. For small problems such as this, the communication will take as much time as the speed-up of the partitioning.
This can be illustrated by refining the mesh:

for i in range(2):
    mesh = refine(mesh)

which will yield the following output:

fenics@3d1c51f37d8c:/root/shared/navier-stokes$ time sudo  mpirun -n 1 python3 demo_navier-stokes.py 
15170 1937

real    0m21.718s
user    1m20.987s
sys     4m13.590s
fenics@3d1c51f37d8c:/root/shared/navier-stokes$ time sudo  mpirun -n 2 python3 demo_navier-stokes.py 
real    0m10.502s
user    0m20.699s
sys     0m1.968s
fenics@3d1c51f37d8c:/root/shared/navier-stokes$ time sudo  mpirun -n 4 python3 demo_navier-stokes.py 
real    0m7.383s
user    0m28.660s
sys     0m2.509s

As you can observe here, going from one to two processes gives you a significant speedup. However, as we go from 2 to 4 processes, we see that the runtime is decreased, but not halfed, as the number of dofs on each process decreases.

2 Likes

Thank you very much for your help! Based on your refinement suggestion, I can observe the speedup in parallel now.