Parallel slower runtime, MPI Anaconda

Continuing the discussion from Parallel slower runtime:

Hello everyone,
I am running some code using dolfinx and am trying to figure out how it scales.
So I am using the code from Parallel slower runtime - #2 by dokken
in an anaconda environment newly built as proposed in https://fenicsproject.org/download/

When doing just that, my code gets significantly slower the more threads I am using…
Here is the adapted code:

import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI
import time


L, W, H=10, 10, 10
NX, NY, NZ = 100, 100, 100

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]

start = time.perf_counter()
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")

And this are the timing results

mpirun -n 1 python parallelTest.py 
MESH CREATED 0, num owned cells 1000000 num ghost cells 0 time : 4.300701394677162

mpirun -n 2 python parallelTest.py 
MESH CREATED 0, num owned cells 500000 num ghost cells 10000 time : 39.404316327068955
MESH CREATED 1, num owned cells 500000 num ghost cells 10000 time : 39.40367920091376

mpirun -n 4 python parallelTest.py 
MESH CREATED 3, num owned cells 251000 num ghost cells 10170 time : 75.6433406448923
MESH CREATED 2, num owned cells 250539 num ghost cells 10257 time : 75.64872138295323
MESH CREATED 0, num owned cells 249461 num ghost cells 10040 time : 75.64881594991311
MESH CREATED 1, num owned cells 249000 num ghost cells 10130 time : 75.64861655794084

I am a bit unsure how to solve that problem and would be super grateful for advice!

Doing some debugging, it seems like the issue is the mesh-partitioner Parmetis:

import numpy as np
from mpi4py import MPI

from dolfinx import mesh, log, common
import time
log.set_log_level(log.LogLevel.INFO)

N = 50
MPI.COMM_WORLD.Barrier()
start = time.perf_counter()
domain=mesh.create_unit_cube( 
    MPI.COMM_WORLD, 
    N, N, N,
    ghost_mode=mesh.GhostMode.none
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")


common.list_timings(domain.comm, [common.TimingType.wall])

Serial timings:

[MPI_MAX] Summary of timings                                                |  reps  wall avg  wall tot
-------------------------------------------------------------------------------------------------------
Build BoxMesh                                                               |     1  0.985317  0.985317
Build dofmap data                                                           |     1  0.021822  0.021822
Compute dof reordering map                                                  |     1  0.002875  0.002875
Compute local part of mesh dual graph                                       |     1  0.329196  0.329196
Compute local-to-local map                                                  |     1  0.012753  0.012753
Compute-local-to-global links for global/local adjacency list               |     1  0.003820  0.003820
Distribute row-wise data (scalable)                                         |     1  0.001776  0.001776
GPS: create_level_structure                                                 |     2  0.007475  0.014949
Gibbs-Poole-Stockmeyer ordering                                             |     1  0.056955  0.056955
Init dofmap from element dofmap                                             |     1  0.016402  0.016402
Topology: create                                                            |     1  0.452937  0.452937
Topology: determine shared index ownership                                  |     1  0.000697  0.000697
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1  0.039235  0.039235

Parallel:

Build BoxMesh                                                               |     1  19.446675  19.446675
Build dofmap data                                                           |     1   0.012467   0.012467
Compute dof reordering map                                                  |     1   0.001648   0.001648
Compute graph partition (ParMETIS)                                          |     1  18.675038  18.675038
Compute local part of mesh dual graph                                       |     2   0.190864   0.381727
Compute local-to-local map                                                  |     1   0.006359   0.006359
Compute non-local part of mesh dual graph                                   |     1   0.012398   0.012398
Compute-local-to-global links for global/local adjacency list               |     1   0.002430   0.002430
Distribute AdjacencyList nodes to destination ranks                         |     1   0.052734   0.052734
Distribute row-wise data (scalable)                                         |     1   0.003325   0.003325
GPS: create_level_structure                                                 |     2   0.002957   0.008260
Gibbs-Poole-Stockmeyer ordering                                             |     1   0.022690   0.022690
Init dofmap from element dofmap                                             |     1   0.008993   0.008993
ParMETIS: call ParMETIS_V3_PartKway                                         |     1  18.672538  18.672538
Topology: create                                                            |     1   0.252157   0.252157
Topology: determine shared index ownership                                  |     1   0.010045   0.010045
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1   0.019589   0.019589

Seems to be a similar issue with SCOTCH

import numpy as np
from mpi4py import MPI

from dolfinx import mesh, log, common
import time

from dolfinx.graph import partitioner_parmetis, partitioner_scotch
log.set_log_level(log.LogLevel.INFO)
N = 50
MPI.COMM_WORLD.Barrier()
start = time.perf_counter()
domain=mesh.create_unit_cube( 
    MPI.COMM_WORLD, 
    N, N, N,
    ghost_mode=mesh.GhostMode.none, partitioner=mesh.create_cell_partitioner(partitioner_scotch()))
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")


common.list_timings(domain.comm, [common.TimingType.wall])

Could you try to create a new environment with:

conda install -c conda-forge fenics-dolfinx openmpi pyvista

as this “resolves” the issue on my end. I.e. there is an issue with mesh partitioning (scotch/parmetis) and mpich on conda.

Hi,
Thank you for your reply and your effort!
As you proposed, I used your first code and compared the timings with 1 and 4 threads.
Serial Case:

MESH CREATED 0, num owned cells 750000 num ghost cells 0 time : 1.8237453778274357

[MPI_MAX] Summary of timings                                                |  reps  wall avg  wall tot
-------------------------------------------------------------------------------------------------------
Build BoxMesh                                                               |     1  1.821910  1.821910
Build dofmap data                                                           |     1  0.054585  0.054585
Compute dof reordering map                                                  |     1  0.005850  0.005850
Compute local part of mesh dual graph                                       |     1  0.624141  0.624141
Compute local-to-local map                                                  |     1  0.018537  0.018537
Compute-local-to-global links for global/local adjacency list               |     1  0.007299  0.007299
Distribute row-wise data (scalable)                                         |     1  0.006210  0.006210
GPS: create_level_structure                                                 |     2  0.019797  0.039595
Gibbs-Poole-Stockmeyer ordering                                             |     1  0.121966  0.121966
Init dofmap from element dofmap                                             |     1  0.043169  0.043169
Topology: create                                                            |     1  0.696700  0.696700
Topology: determine shared index ownership                                  |     1  0.002529  0.002529
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1  0.097153  0.097153

Parallel Case:

MESH CREATED 0, num owned cells 188656 num ghost cells 0 time : 27.415021778084338
MESH CREATED 1, num owned cells 184046 num ghost cells 0 time : 27.415049467235804
MESH CREATED 3, num owned cells 187687 num ghost cells 0 time : 27.415032648015767
MESH CREATED 2, num owned cells 189611 num ghost cells 0 time : 27.41522487672046

[MPI_MAX] Summary of timings                                                |  reps   wall avg   wall tot
---------------------------------------------------------------------------------------------------------
Build BoxMesh                                                               |     1  27.413550  27.413550
Build dofmap data                                                           |     1   0.013588   0.013588
Compute dof reordering map                                                  |     1   0.001441   0.001441
Compute graph partition (ParMETIS)                                          |     1  26.818224  26.818224
Compute local part of mesh dual graph                                       |     2   0.148587   0.297173
Compute local-to-local map                                                  |     1   0.003977   0.003977
Compute non-local part of mesh dual graph                                   |     1   0.038450   0.038450
Compute-local-to-global links for global/local adjacency list               |     1   0.001844   0.001844
Distribute AdjacencyList nodes to destination ranks                         |     1   0.084880   0.084880
Distribute row-wise data (scalable)                                         |     1   0.004141   0.004141
GPS: create_level_structure                                                 |     2   0.002291   0.013194
Gibbs-Poole-Stockmeyer ordering                                             |     1   0.024049   0.024049
Init dofmap from element dofmap                                             |     1   0.010204   0.010204
ParMETIS: call ParMETIS_V3_PartKway                                         |     1  26.816365  26.816365
Topology: create                                                            |     1   0.182377   0.182377
Topology: determine shared index ownership                                  |     1   0.037255   0.037255
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1   0.013835   0.013835 

So the problem seems to be with the mesh partitioning and somehow still persists…

There is something I noticed:
When using the docker container, it the speedup is as expected.
The dolfinx version is the same but the petsc version is different (conda: petsc4py 3.20.3, docker: 3.20.0).

Do you have an idea what I could do?
Thanks a lot

Did you change to openmpi?
What is the output of: mpirun -h
For me it is:

mpirun (Open MPI) 4.1.6

which for me runs with no issue (using conda on ubuntu 22.04).

1 Like

Weird! I am using ubuntu 20.04 and the mpi version is

mpirun (Open MPI) 4.1.6

so the same as you…

Since you’re running on Ubuntu, consider updating your installation to a more recent Ubuntu release, then see how the parallel run compares using the native ubuntu packages (from the FEniCS PPA) rather than the conda build.

1 Like

Thanks a lot!
As everything seems to run well when I am in the docker container, I will probably stick to that for the moment.
I can’t really update to 22.04 right now as it is a shared workstation, but if that works once I was to upgrade, I will send an update here :slight_smile:

1 Like