Parallel slower runtime

Good afternoon,

I have written everything correctly for parallel running, but it seems that apart from 2 processes parallel processing get much slower with each process, as shown with the example of mesh generation:

#IMPORTS
import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI

start=MPI.Wtime()

L, W, H=0.303, 0.19, 0.064
NX, NY, NZ = 60, 38, 12

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
dim=domain.topology.dim

print("MESH CREATED", domain.comm.rank, ", time : ", MPI.Wtime()-start)

Running with wsl2 and inside a docker container, i get:

root@3540b4ce1991:~# /bin/python3 /root/paralleltest.py
MESH CREATED 0 , time :  0.11253279999982624
root@3540b4ce1991:~# mpiexec -n 2 python3 paralleltest.py
MESH CREATED 0 , time :  0.08136239999998907
MESH CREATED 1 , time :  0.08132950000003802
root@3540b4ce1991:~# mpiexec -n 4 python3 paralleltest.py
MESH CREATED 0 , time :  6.622138599999971
MESH CREATED 1 , time :  6.612982199999806
MESH CREATED 2 , time :  6.639677599999914
MESH CREATED 3 , time :  6.638597200000049
root@3540b4ce1991:~# mpiexec -n 8 python3 paralleltest.py
MESH CREATED 0 , time :  13.124793300000192
MESH CREATED 2 , time :  13.114502099999982
MESH CREATED 4 , time :  13.141399099999944
MESH CREATED 3 , time :  13.12165970000001
MESH CREATED 6 , time :  13.127462200000082
MESH CREATED 7 , time :  13.13018420000003
MESH CREATED 5 , time :  13.164611700000023
MESH CREATED 1 , time :  13.155241200000091

What is the cause of such a slowdown? all the communications?
Thanks in advance for the help.

I cannot reproduce your problem:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0 , time :  0.08017910199987455
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0 , time :  0.0637212950000503
MESH CREATED 1 , time :  0.06375895700011824
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0 , time :  0.06122965600025054
MESH CREATED 1 , time :  0.06122667699992235
MESH CREATED 2 , time :  0.061187144000086846
MESH CREATED 3 , time :  0.06122557299977416
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0 , time :  0.08799018899981093
MESH CREATED 1 , time :  0.08810000099992976
MESH CREATED 2 , time :  0.08809564100010903
MESH CREATED 3 , time :  0.08810852800024804
MESH CREATED 4 , time :  0.08635853400028282
MESH CREATED 5 , time :  0.088105990999793
MESH CREATED 6 , time :  0.08810199199979252
MESH CREATED 7 , time :  0.08635979000018779

Let us for a moment consider how many cells you have on each process, by slightly modifying your script (and timings)

#IMPORTS
import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI
import time


L, W, H=0.303, 0.19, 0.064
NX, NY, NZ = 60, 38, 12

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]

start = time.perf_counter()
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")

returning

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 27360 num ghost cells 0 time : 0.07934825399934198
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 13452 num ghost cells 456 time : 0.06430566199924215
MESH CREATED 1, num owned cells 13908 num ghost cells 456 time : 0.06433292200017604
root@82df035a5e65:~/shared# mpirun -n 3 python3 mwe.py 
MESH CREATED 0, num owned cells 9120 num ghost cells 468 time : 0.0728011629980756
MESH CREATED 1, num owned cells 9120 num ghost cells 456 time : 0.07279534000190324
MESH CREATED 2, num owned cells 9120 num ghost cells 924 time : 0.07272747499882826
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 6804 num ghost cells 600 time : 0.07016314100110321
MESH CREATED 1, num owned cells 6792 num ghost cells 624 time : 0.07190052900114097
MESH CREATED 2, num owned cells 6804 num ghost cells 588 time : 0.07190053100202931
MESH CREATED 3, num owned cells 6960 num ghost cells 588 time : 0.07016271399697871
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0, num owned cells 3492 num ghost cells 420 time : 0.09275519299990265
MESH CREATED 1, num owned cells 3492 num ghost cells 660 time : 0.092620289997285
MESH CREATED 2, num owned cells 3348 num ghost cells 396 time : 0.09459212000001571
MESH CREATED 3, num owned cells 3348 num ghost cells 624 time : 0.09452851700189058
MESH CREATED 4, num owned cells 3384 num ghost cells 624 time : 0.09457868400204461
MESH CREATED 5, num owned cells 3456 num ghost cells 408 time : 0.09458514400103013
MESH CREATED 6, num owned cells 3360 num ghost cells 408 time : 0.09459226900071371
MESH CREATED 7, num owned cells 3480 num ghost cells 636 time : 0.09458201999950688

As you can see, you have only 3000 cells per process, and 600 ghost cells when you use 8 processes. This means alot of parallel communication.
However, let us increase the number of cells in your mesh, by changing Nx,Ny,Nz to 80,80,80, then you get the following timings:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 512000 num ghost cells 0 time : 1.9082258399976126
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 251605 num ghost cells 6402 time : 1.1913549609998881
MESH CREATED 1, num owned cells 260395 num ghost cells 6402 time : 1.1914042250027705
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 127086 num ghost cells 6322 time : 0.7972464410013345
MESH CREATED 1, num owned cells 129142 num ghost cells 6726 time : 0.7989725849984097
MESH CREATED 3, num owned cells 128956 num ghost cells 6539 time : 0.7990476900013164
MESH CREATED 2, num owned cells 126816 num ghost cells 6467 time : 0.7973435339990829
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 4, num owned cells 63760 num ghost cells 4754 time : 0.6922488029995293
MESH CREATED 0, num owned cells 64000 num ghost cells 4840 time : 0.6923211989997071
MESH CREATED 7, num owned cells 64240 num ghost cells 4846 time : 0.6940702939973562
MESH CREATED 2, num owned cells 62721 num ghost cells 4775 time : 0.6941660070006037
MESH CREATED 1, num owned cells 64000 num ghost cells 4760 time : 0.6940927770010603
MESH CREATED 3, num owned cells 65279 num ghost cells 4825 time : 0.6942647059986484
MESH CREATED 6, num owned cells 64240 num ghost cells 4846 time : 0.6942589370009955
MESH CREATED 5, num owned cells 63760 num ghost cells 4754 time : 0.6941067049992853

As you can see, the runtime is drastically reduced from 1 to 2 processes.
DOLFINx is built to be scalable for large computations, and we check our performance with 500 000 dofs per process. See details of implementation at: GitHub - FEniCS/performance-test: Mini App for FEniCS-X performance testing and the scaling results at: https://fenics.github.io/performance-test-results/
There you can see the mesh creation time scales quite well on 1, 2, 4, 8 and 16 nodes on a HPC cluster

1 Like

Hi Festv, I run into the same situation with you. Have you been able to figure this out? Thanks!
/dongjun

Without a minimal reproducible example, I’m not sure it is worth-while reopening this thread.
As I showed above, you need a reasonable amount of cells for parallel speedup to be expected.

Expanding on @dokken’s response, parallel scaling is not automatic. There is always a certain part of the code which necessarily cannot be parallelised (meaning all processors are running the same code). So the total runtime t can be quantified as

t = s + \frac{p}{ N}

(related to Amdahl’s law) where s is the time to run the common serial code (which all processes must run), and p is the total time running parallel code. N is the number of processes. So runtime for the parallel portion p can be reduced with more processors N. But the total runtime can never be less than s.

If you have have a small number of dofs, then p, the time required to process those dofs (in parallel), is relatively smaller. In that case s is a larger proportion of the total runtime, and, as @dokken said, you won’t get much (or any) speed up for such small jobs.

If there are a large number of dofs then p becomes relatively much larger than s, so you can get better speed up using more processors.

More discussion of the theory of speedup at
https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling/

1 Like

Thanks for your swift reply. After reinstalling using spack, the problem is now solved. Before, I installed using conda. Not sure why …

Conda is not optimized for your system, it uses binaries, making it fast, but potentially slower to use.