Parallel slower runtime

festv · June 15, 2022, 12:29pm

Good afternoon,

I have written everything correctly for parallel running, but it seems that apart from 2 processes parallel processing get much slower with each process, as shown with the example of mesh generation:

#IMPORTS
import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI

start=MPI.Wtime()

L, W, H=0.303, 0.19, 0.064
NX, NY, NZ = 60, 38, 12

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
dim=domain.topology.dim

print("MESH CREATED", domain.comm.rank, ", time : ", MPI.Wtime()-start)

Running with wsl2 and inside a docker container, i get:

root@3540b4ce1991:~# /bin/python3 /root/paralleltest.py
MESH CREATED 0 , time :  0.11253279999982624
root@3540b4ce1991:~# mpiexec -n 2 python3 paralleltest.py
MESH CREATED 0 , time :  0.08136239999998907
MESH CREATED 1 , time :  0.08132950000003802
root@3540b4ce1991:~# mpiexec -n 4 python3 paralleltest.py
MESH CREATED 0 , time :  6.622138599999971
MESH CREATED 1 , time :  6.612982199999806
MESH CREATED 2 , time :  6.639677599999914
MESH CREATED 3 , time :  6.638597200000049
root@3540b4ce1991:~# mpiexec -n 8 python3 paralleltest.py
MESH CREATED 0 , time :  13.124793300000192
MESH CREATED 2 , time :  13.114502099999982
MESH CREATED 4 , time :  13.141399099999944
MESH CREATED 3 , time :  13.12165970000001
MESH CREATED 6 , time :  13.127462200000082
MESH CREATED 7 , time :  13.13018420000003
MESH CREATED 5 , time :  13.164611700000023
MESH CREATED 1 , time :  13.155241200000091

What is the cause of such a slowdown? all the communications?
Thanks in advance for the help.

dokken · June 15, 2022, 12:52pm

I cannot reproduce your problem:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0 , time :  0.08017910199987455
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0 , time :  0.0637212950000503
MESH CREATED 1 , time :  0.06375895700011824
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0 , time :  0.06122965600025054
MESH CREATED 1 , time :  0.06122667699992235
MESH CREATED 2 , time :  0.061187144000086846
MESH CREATED 3 , time :  0.06122557299977416
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0 , time :  0.08799018899981093
MESH CREATED 1 , time :  0.08810000099992976
MESH CREATED 2 , time :  0.08809564100010903
MESH CREATED 3 , time :  0.08810852800024804
MESH CREATED 4 , time :  0.08635853400028282
MESH CREATED 5 , time :  0.088105990999793
MESH CREATED 6 , time :  0.08810199199979252
MESH CREATED 7 , time :  0.08635979000018779

Let us for a moment consider how many cells you have on each process, by slightly modifying your script (and timings)

#IMPORTS
import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI
import time


L, W, H=0.303, 0.19, 0.064
NX, NY, NZ = 60, 38, 12

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]

start = time.perf_counter()
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")

returning

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 27360 num ghost cells 0 time : 0.07934825399934198
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 13452 num ghost cells 456 time : 0.06430566199924215
MESH CREATED 1, num owned cells 13908 num ghost cells 456 time : 0.06433292200017604
root@82df035a5e65:~/shared# mpirun -n 3 python3 mwe.py 
MESH CREATED 0, num owned cells 9120 num ghost cells 468 time : 0.0728011629980756
MESH CREATED 1, num owned cells 9120 num ghost cells 456 time : 0.07279534000190324
MESH CREATED 2, num owned cells 9120 num ghost cells 924 time : 0.07272747499882826
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 6804 num ghost cells 600 time : 0.07016314100110321
MESH CREATED 1, num owned cells 6792 num ghost cells 624 time : 0.07190052900114097
MESH CREATED 2, num owned cells 6804 num ghost cells 588 time : 0.07190053100202931
MESH CREATED 3, num owned cells 6960 num ghost cells 588 time : 0.07016271399697871
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0, num owned cells 3492 num ghost cells 420 time : 0.09275519299990265
MESH CREATED 1, num owned cells 3492 num ghost cells 660 time : 0.092620289997285
MESH CREATED 2, num owned cells 3348 num ghost cells 396 time : 0.09459212000001571
MESH CREATED 3, num owned cells 3348 num ghost cells 624 time : 0.09452851700189058
MESH CREATED 4, num owned cells 3384 num ghost cells 624 time : 0.09457868400204461
MESH CREATED 5, num owned cells 3456 num ghost cells 408 time : 0.09458514400103013
MESH CREATED 6, num owned cells 3360 num ghost cells 408 time : 0.09459226900071371
MESH CREATED 7, num owned cells 3480 num ghost cells 636 time : 0.09458201999950688

As you can see, you have only 3000 cells per process, and 600 ghost cells when you use 8 processes. This means alot of parallel communication.
However, let us increase the number of cells in your mesh, by changing Nx,Ny,Nz to 80,80,80, then you get the following timings:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 512000 num ghost cells 0 time : 1.9082258399976126
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 251605 num ghost cells 6402 time : 1.1913549609998881
MESH CREATED 1, num owned cells 260395 num ghost cells 6402 time : 1.1914042250027705
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 127086 num ghost cells 6322 time : 0.7972464410013345
MESH CREATED 1, num owned cells 129142 num ghost cells 6726 time : 0.7989725849984097
MESH CREATED 3, num owned cells 128956 num ghost cells 6539 time : 0.7990476900013164
MESH CREATED 2, num owned cells 126816 num ghost cells 6467 time : 0.7973435339990829
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 4, num owned cells 63760 num ghost cells 4754 time : 0.6922488029995293
MESH CREATED 0, num owned cells 64000 num ghost cells 4840 time : 0.6923211989997071
MESH CREATED 7, num owned cells 64240 num ghost cells 4846 time : 0.6940702939973562
MESH CREATED 2, num owned cells 62721 num ghost cells 4775 time : 0.6941660070006037
MESH CREATED 1, num owned cells 64000 num ghost cells 4760 time : 0.6940927770010603
MESH CREATED 3, num owned cells 65279 num ghost cells 4825 time : 0.6942647059986484
MESH CREATED 6, num owned cells 64240 num ghost cells 4846 time : 0.6942589370009955
MESH CREATED 5, num owned cells 63760 num ghost cells 4754 time : 0.6941067049992853

As you can see, the runtime is drastically reduced from 1 to 2 processes.
DOLFINx is built to be scalable for large computations, and we check our performance with 500 000 dofs per process. See details of implementation at: GitHub - FEniCS/performance-test: Mini App for FEniCS-X performance testing and the scaling results at: https://fenics.github.io/performance-test-results/
There you can see the mesh creation time scales quite well on 1, 2, 4, 8 and 16 nodes on a HPC cluster

dongjun · February 19, 2024, 1:34pm

Hi Festv, I run into the same situation with you. Have you been able to figure this out? Thanks!
/dongjun

dokken · February 19, 2024, 2:26pm

Without a minimal reproducible example, I’m not sure it is worth-while reopening this thread.
As I showed above, you need a reasonable amount of cells for parallel speedup to be expected.

dparsons · February 19, 2024, 2:45pm

Expanding on @dokken’s response, parallel scaling is not automatic. There is always a certain part of the code which necessarily cannot be parallelised (meaning all processors are running the same code). So the total runtime t can be quantified as

t = s + \frac{p}{ N}

(related to Amdahl’s law) where s is the time to run the common serial code (which all processes must run), and p is the total time running parallel code. N is the number of processes. So runtime for the parallel portion p can be reduced with more processors N. But the total runtime can never be less than s.

If you have have a small number of dofs, then p, the time required to process those dofs (in parallel), is relatively smaller. In that case s is a larger proportion of the total runtime, and, as @dokken said, you won’t get much (or any) speed up for such small jobs.

If there are a large number of dofs then p becomes relatively much larger than s, so you can get better speed up using more processors.

More discussion of the theory of speedup at
https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling/

dongjun · February 19, 2024, 4:40pm

Thanks for your swift reply. After reinstalling using spack, the problem is now solved. Before, I installed using conda. Not sure why …

dokken · February 19, 2024, 6:27pm

Conda is not optimized for your system, it uses binaries, making it fast, but potentially slower to use.

Topic		Replies	Views
Parallel slower runtime, MPI Anaconda dolfinx	7	438	January 17, 2024
Problems in running in parallel dolfinx	10	1183	April 21, 2023
Problem solving in parallel slower than in serial General	12	1405	April 4, 2023
FFCx compilation speed decreased? General	11	576	September 12, 2023
Parallel much slower when creating high-order functionspace dolfinx mpi	4	54	October 6, 2024

Parallel slower runtime

Related topics