Parallel slower runtime

dokken · June 15, 2022, 12:52pm

I cannot reproduce your problem:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0 , time :  0.08017910199987455
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0 , time :  0.0637212950000503
MESH CREATED 1 , time :  0.06375895700011824
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0 , time :  0.06122965600025054
MESH CREATED 1 , time :  0.06122667699992235
MESH CREATED 2 , time :  0.061187144000086846
MESH CREATED 3 , time :  0.06122557299977416
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0 , time :  0.08799018899981093
MESH CREATED 1 , time :  0.08810000099992976
MESH CREATED 2 , time :  0.08809564100010903
MESH CREATED 3 , time :  0.08810852800024804
MESH CREATED 4 , time :  0.08635853400028282
MESH CREATED 5 , time :  0.088105990999793
MESH CREATED 6 , time :  0.08810199199979252
MESH CREATED 7 , time :  0.08635979000018779

Let us for a moment consider how many cells you have on each process, by slightly modifying your script (and timings)

#IMPORTS
import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI
import time


L, W, H=0.303, 0.19, 0.064
NX, NY, NZ = 60, 38, 12

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]

start = time.perf_counter()
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")

returning

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 27360 num ghost cells 0 time : 0.07934825399934198
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 13452 num ghost cells 456 time : 0.06430566199924215
MESH CREATED 1, num owned cells 13908 num ghost cells 456 time : 0.06433292200017604
root@82df035a5e65:~/shared# mpirun -n 3 python3 mwe.py 
MESH CREATED 0, num owned cells 9120 num ghost cells 468 time : 0.0728011629980756
MESH CREATED 1, num owned cells 9120 num ghost cells 456 time : 0.07279534000190324
MESH CREATED 2, num owned cells 9120 num ghost cells 924 time : 0.07272747499882826
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 6804 num ghost cells 600 time : 0.07016314100110321
MESH CREATED 1, num owned cells 6792 num ghost cells 624 time : 0.07190052900114097
MESH CREATED 2, num owned cells 6804 num ghost cells 588 time : 0.07190053100202931
MESH CREATED 3, num owned cells 6960 num ghost cells 588 time : 0.07016271399697871
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0, num owned cells 3492 num ghost cells 420 time : 0.09275519299990265
MESH CREATED 1, num owned cells 3492 num ghost cells 660 time : 0.092620289997285
MESH CREATED 2, num owned cells 3348 num ghost cells 396 time : 0.09459212000001571
MESH CREATED 3, num owned cells 3348 num ghost cells 624 time : 0.09452851700189058
MESH CREATED 4, num owned cells 3384 num ghost cells 624 time : 0.09457868400204461
MESH CREATED 5, num owned cells 3456 num ghost cells 408 time : 0.09458514400103013
MESH CREATED 6, num owned cells 3360 num ghost cells 408 time : 0.09459226900071371
MESH CREATED 7, num owned cells 3480 num ghost cells 636 time : 0.09458201999950688

As you can see, you have only 3000 cells per process, and 600 ghost cells when you use 8 processes. This means alot of parallel communication.
However, let us increase the number of cells in your mesh, by changing Nx,Ny,Nz to 80,80,80, then you get the following timings:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 512000 num ghost cells 0 time : 1.9082258399976126
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 251605 num ghost cells 6402 time : 1.1913549609998881
MESH CREATED 1, num owned cells 260395 num ghost cells 6402 time : 1.1914042250027705
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 127086 num ghost cells 6322 time : 0.7972464410013345
MESH CREATED 1, num owned cells 129142 num ghost cells 6726 time : 0.7989725849984097
MESH CREATED 3, num owned cells 128956 num ghost cells 6539 time : 0.7990476900013164
MESH CREATED 2, num owned cells 126816 num ghost cells 6467 time : 0.7973435339990829
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 4, num owned cells 63760 num ghost cells 4754 time : 0.6922488029995293
MESH CREATED 0, num owned cells 64000 num ghost cells 4840 time : 0.6923211989997071
MESH CREATED 7, num owned cells 64240 num ghost cells 4846 time : 0.6940702939973562
MESH CREATED 2, num owned cells 62721 num ghost cells 4775 time : 0.6941660070006037
MESH CREATED 1, num owned cells 64000 num ghost cells 4760 time : 0.6940927770010603
MESH CREATED 3, num owned cells 65279 num ghost cells 4825 time : 0.6942647059986484
MESH CREATED 6, num owned cells 64240 num ghost cells 4846 time : 0.6942589370009955
MESH CREATED 5, num owned cells 63760 num ghost cells 4754 time : 0.6941067049992853

As you can see, the runtime is drastically reduced from 1 to 2 processes.
DOLFINx is built to be scalable for large computations, and we check our performance with 500 000 dofs per process. See details of implementation at: GitHub - FEniCS/performance-test: Mini App for FEniCS-X performance testing and the scaling results at: https://fenics.github.io/performance-test-results/
There you can see the mesh creation time scales quite well on 1, 2, 4, 8 and 16 nodes on a HPC cluster

Topic		Replies	Views
Parallel slower runtime, MPI Anaconda dolfinx	7	438	January 17, 2024
Problems in running in parallel dolfinx	10	1183	April 21, 2023
Problem solving in parallel slower than in serial General	12	1405	April 4, 2023
FFCx compilation speed decreased? General	11	576	September 12, 2023
Parallel much slower when creating high-order functionspace dolfinx mpi	4	54	October 6, 2024

Parallel slower runtime

Related topics