Parallel slower runtime

I cannot reproduce your problem:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0 , time :  0.08017910199987455
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0 , time :  0.0637212950000503
MESH CREATED 1 , time :  0.06375895700011824
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0 , time :  0.06122965600025054
MESH CREATED 1 , time :  0.06122667699992235
MESH CREATED 2 , time :  0.061187144000086846
MESH CREATED 3 , time :  0.06122557299977416
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0 , time :  0.08799018899981093
MESH CREATED 1 , time :  0.08810000099992976
MESH CREATED 2 , time :  0.08809564100010903
MESH CREATED 3 , time :  0.08810852800024804
MESH CREATED 4 , time :  0.08635853400028282
MESH CREATED 5 , time :  0.088105990999793
MESH CREATED 6 , time :  0.08810199199979252
MESH CREATED 7 , time :  0.08635979000018779

Let us for a moment consider how many cells you have on each process, by slightly modifying your script (and timings)

#IMPORTS
import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI
import time


L, W, H=0.303, 0.19, 0.064
NX, NY, NZ = 60, 38, 12

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]

start = time.perf_counter()
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")

returning

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 27360 num ghost cells 0 time : 0.07934825399934198
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 13452 num ghost cells 456 time : 0.06430566199924215
MESH CREATED 1, num owned cells 13908 num ghost cells 456 time : 0.06433292200017604
root@82df035a5e65:~/shared# mpirun -n 3 python3 mwe.py 
MESH CREATED 0, num owned cells 9120 num ghost cells 468 time : 0.0728011629980756
MESH CREATED 1, num owned cells 9120 num ghost cells 456 time : 0.07279534000190324
MESH CREATED 2, num owned cells 9120 num ghost cells 924 time : 0.07272747499882826
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 6804 num ghost cells 600 time : 0.07016314100110321
MESH CREATED 1, num owned cells 6792 num ghost cells 624 time : 0.07190052900114097
MESH CREATED 2, num owned cells 6804 num ghost cells 588 time : 0.07190053100202931
MESH CREATED 3, num owned cells 6960 num ghost cells 588 time : 0.07016271399697871
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 0, num owned cells 3492 num ghost cells 420 time : 0.09275519299990265
MESH CREATED 1, num owned cells 3492 num ghost cells 660 time : 0.092620289997285
MESH CREATED 2, num owned cells 3348 num ghost cells 396 time : 0.09459212000001571
MESH CREATED 3, num owned cells 3348 num ghost cells 624 time : 0.09452851700189058
MESH CREATED 4, num owned cells 3384 num ghost cells 624 time : 0.09457868400204461
MESH CREATED 5, num owned cells 3456 num ghost cells 408 time : 0.09458514400103013
MESH CREATED 6, num owned cells 3360 num ghost cells 408 time : 0.09459226900071371
MESH CREATED 7, num owned cells 3480 num ghost cells 636 time : 0.09458201999950688

As you can see, you have only 3000 cells per process, and 600 ghost cells when you use 8 processes. This means alot of parallel communication.
However, let us increase the number of cells in your mesh, by changing Nx,Ny,Nz to 80,80,80, then you get the following timings:

root@82df035a5e65:~/shared# python3 mwe.py 
MESH CREATED 0, num owned cells 512000 num ghost cells 0 time : 1.9082258399976126
root@82df035a5e65:~/shared# mpirun -n 2 python3 mwe.py 
MESH CREATED 0, num owned cells 251605 num ghost cells 6402 time : 1.1913549609998881
MESH CREATED 1, num owned cells 260395 num ghost cells 6402 time : 1.1914042250027705
root@82df035a5e65:~/shared# mpirun -n 4 python3 mwe.py 
MESH CREATED 0, num owned cells 127086 num ghost cells 6322 time : 0.7972464410013345
MESH CREATED 1, num owned cells 129142 num ghost cells 6726 time : 0.7989725849984097
MESH CREATED 3, num owned cells 128956 num ghost cells 6539 time : 0.7990476900013164
MESH CREATED 2, num owned cells 126816 num ghost cells 6467 time : 0.7973435339990829
root@82df035a5e65:~/shared# mpirun -n 8 python3 mwe.py 
MESH CREATED 4, num owned cells 63760 num ghost cells 4754 time : 0.6922488029995293
MESH CREATED 0, num owned cells 64000 num ghost cells 4840 time : 0.6923211989997071
MESH CREATED 7, num owned cells 64240 num ghost cells 4846 time : 0.6940702939973562
MESH CREATED 2, num owned cells 62721 num ghost cells 4775 time : 0.6941660070006037
MESH CREATED 1, num owned cells 64000 num ghost cells 4760 time : 0.6940927770010603
MESH CREATED 3, num owned cells 65279 num ghost cells 4825 time : 0.6942647059986484
MESH CREATED 6, num owned cells 64240 num ghost cells 4846 time : 0.6942589370009955
MESH CREATED 5, num owned cells 63760 num ghost cells 4754 time : 0.6941067049992853

As you can see, the runtime is drastically reduced from 1 to 2 processes.
DOLFINx is built to be scalable for large computations, and we check our performance with 500 000 dofs per process. See details of implementation at: GitHub - FEniCS/performance-test: Mini App for FEniCS-X performance testing and the scaling results at: https://fenics.github.io/performance-test-results/
There you can see the mesh creation time scales quite well on 1, 2, 4, 8 and 16 nodes on a HPC cluster

1 Like