Parallel slower runtime, MPI Anaconda

PhilippIce · January 16, 2024, 9:30am

Continuing the discussion from Parallel slower runtime:

Hello everyone,
I am running some code using dolfinx and am trying to figure out how it scales.
So I am using the code from Parallel slower runtime - #2 by dokken
in an anaconda environment newly built as proposed in https://fenicsproject.org/download/

When doing just that, my code gets significantly slower the more threads I am using…
Here is the adapted code:

import numpy as np
from dolfinx import cpp, mesh
from mpi4py import MPI
import time


L, W, H=10, 10, 10
NX, NY, NZ = 100, 100, 100

#MESH CREATION
points=[np.array([0, 0, 0]), np.array([L, W, H])]

start = time.perf_counter()
domain=mesh.create_box( 
    MPI.COMM_WORLD, 
    points,
    [NX,NY,NZ],
    cell_type=cpp.mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")

And this are the timing results

mpirun -n 1 python parallelTest.py 
MESH CREATED 0, num owned cells 1000000 num ghost cells 0 time : 4.300701394677162

mpirun -n 2 python parallelTest.py 
MESH CREATED 0, num owned cells 500000 num ghost cells 10000 time : 39.404316327068955
MESH CREATED 1, num owned cells 500000 num ghost cells 10000 time : 39.40367920091376

mpirun -n 4 python parallelTest.py 
MESH CREATED 3, num owned cells 251000 num ghost cells 10170 time : 75.6433406448923
MESH CREATED 2, num owned cells 250539 num ghost cells 10257 time : 75.64872138295323
MESH CREATED 0, num owned cells 249461 num ghost cells 10040 time : 75.64881594991311
MESH CREATED 1, num owned cells 249000 num ghost cells 10130 time : 75.64861655794084

I am a bit unsure how to solve that problem and would be super grateful for advice!

dokken · January 16, 2024, 12:35pm

Doing some debugging, it seems like the issue is the mesh-partitioner Parmetis:

import numpy as np
from mpi4py import MPI

from dolfinx import mesh, log, common
import time
log.set_log_level(log.LogLevel.INFO)

N = 50
MPI.COMM_WORLD.Barrier()
start = time.perf_counter()
domain=mesh.create_unit_cube( 
    MPI.COMM_WORLD, 
    N, N, N,
    ghost_mode=mesh.GhostMode.none
    )
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")


common.list_timings(domain.comm, [common.TimingType.wall])

Serial timings:

[MPI_MAX] Summary of timings                                                |  reps  wall avg  wall tot
-------------------------------------------------------------------------------------------------------
Build BoxMesh                                                               |     1  0.985317  0.985317
Build dofmap data                                                           |     1  0.021822  0.021822
Compute dof reordering map                                                  |     1  0.002875  0.002875
Compute local part of mesh dual graph                                       |     1  0.329196  0.329196
Compute local-to-local map                                                  |     1  0.012753  0.012753
Compute-local-to-global links for global/local adjacency list               |     1  0.003820  0.003820
Distribute row-wise data (scalable)                                         |     1  0.001776  0.001776
GPS: create_level_structure                                                 |     2  0.007475  0.014949
Gibbs-Poole-Stockmeyer ordering                                             |     1  0.056955  0.056955
Init dofmap from element dofmap                                             |     1  0.016402  0.016402
Topology: create                                                            |     1  0.452937  0.452937
Topology: determine shared index ownership                                  |     1  0.000697  0.000697
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1  0.039235  0.039235

Parallel:

Build BoxMesh                                                               |     1  19.446675  19.446675
Build dofmap data                                                           |     1   0.012467   0.012467
Compute dof reordering map                                                  |     1   0.001648   0.001648
Compute graph partition (ParMETIS)                                          |     1  18.675038  18.675038
Compute local part of mesh dual graph                                       |     2   0.190864   0.381727
Compute local-to-local map                                                  |     1   0.006359   0.006359
Compute non-local part of mesh dual graph                                   |     1   0.012398   0.012398
Compute-local-to-global links for global/local adjacency list               |     1   0.002430   0.002430
Distribute AdjacencyList nodes to destination ranks                         |     1   0.052734   0.052734
Distribute row-wise data (scalable)                                         |     1   0.003325   0.003325
GPS: create_level_structure                                                 |     2   0.002957   0.008260
Gibbs-Poole-Stockmeyer ordering                                             |     1   0.022690   0.022690
Init dofmap from element dofmap                                             |     1   0.008993   0.008993
ParMETIS: call ParMETIS_V3_PartKway                                         |     1  18.672538  18.672538
Topology: create                                                            |     1   0.252157   0.252157
Topology: determine shared index ownership                                  |     1   0.010045   0.010045
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1   0.019589   0.019589

Seems to be a similar issue with SCOTCH

import numpy as np
from mpi4py import MPI

from dolfinx import mesh, log, common
import time

from dolfinx.graph import partitioner_parmetis, partitioner_scotch
log.set_log_level(log.LogLevel.INFO)
N = 50
MPI.COMM_WORLD.Barrier()
start = time.perf_counter()
domain=mesh.create_unit_cube( 
    MPI.COMM_WORLD, 
    N, N, N,
    ghost_mode=mesh.GhostMode.none, partitioner=mesh.create_cell_partitioner(partitioner_scotch()))
end = time.perf_counter()
dim=domain.topology.dim
imap = domain.topology.index_map(dim)
num_cells = imap.size_local
ghost_cells = imap.num_ghosts
print(f"MESH CREATED {domain.comm.rank}, num owned cells {num_cells} num ghost cells {ghost_cells} time : {end-start}")


common.list_timings(domain.comm, [common.TimingType.wall])

dokken · January 16, 2024, 1:32pm

Could you try to create a new environment with:

conda install -c conda-forge fenics-dolfinx openmpi pyvista

as this “resolves” the issue on my end. I.e. there is an issue with mesh partitioning (scotch/parmetis) and mpich on conda.

PhilippIce · January 16, 2024, 2:28pm

Hi,
Thank you for your reply and your effort!
As you proposed, I used your first code and compared the timings with 1 and 4 threads.
Serial Case:

MESH CREATED 0, num owned cells 750000 num ghost cells 0 time : 1.8237453778274357

[MPI_MAX] Summary of timings                                                |  reps  wall avg  wall tot
-------------------------------------------------------------------------------------------------------
Build BoxMesh                                                               |     1  1.821910  1.821910
Build dofmap data                                                           |     1  0.054585  0.054585
Compute dof reordering map                                                  |     1  0.005850  0.005850
Compute local part of mesh dual graph                                       |     1  0.624141  0.624141
Compute local-to-local map                                                  |     1  0.018537  0.018537
Compute-local-to-global links for global/local adjacency list               |     1  0.007299  0.007299
Distribute row-wise data (scalable)                                         |     1  0.006210  0.006210
GPS: create_level_structure                                                 |     2  0.019797  0.039595
Gibbs-Poole-Stockmeyer ordering                                             |     1  0.121966  0.121966
Init dofmap from element dofmap                                             |     1  0.043169  0.043169
Topology: create                                                            |     1  0.696700  0.696700
Topology: determine shared index ownership                                  |     1  0.002529  0.002529
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1  0.097153  0.097153

Parallel Case:

MESH CREATED 0, num owned cells 188656 num ghost cells 0 time : 27.415021778084338
MESH CREATED 1, num owned cells 184046 num ghost cells 0 time : 27.415049467235804
MESH CREATED 3, num owned cells 187687 num ghost cells 0 time : 27.415032648015767
MESH CREATED 2, num owned cells 189611 num ghost cells 0 time : 27.41522487672046

[MPI_MAX] Summary of timings                                                |  reps   wall avg   wall tot
---------------------------------------------------------------------------------------------------------
Build BoxMesh                                                               |     1  27.413550  27.413550
Build dofmap data                                                           |     1   0.013588   0.013588
Compute dof reordering map                                                  |     1   0.001441   0.001441
Compute graph partition (ParMETIS)                                          |     1  26.818224  26.818224
Compute local part of mesh dual graph                                       |     2   0.148587   0.297173
Compute local-to-local map                                                  |     1   0.003977   0.003977
Compute non-local part of mesh dual graph                                   |     1   0.038450   0.038450
Compute-local-to-global links for global/local adjacency list               |     1   0.001844   0.001844
Distribute AdjacencyList nodes to destination ranks                         |     1   0.084880   0.084880
Distribute row-wise data (scalable)                                         |     1   0.004141   0.004141
GPS: create_level_structure                                                 |     2   0.002291   0.013194
Gibbs-Poole-Stockmeyer ordering                                             |     1   0.024049   0.024049
Init dofmap from element dofmap                                             |     1   0.010204   0.010204
ParMETIS: call ParMETIS_V3_PartKway                                         |     1  26.816365  26.816365
Topology: create                                                            |     1   0.182377   0.182377
Topology: determine shared index ownership                                  |     1   0.037255   0.037255
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1   0.013835   0.013835

So the problem seems to be with the mesh partitioning and somehow still persists…

There is something I noticed:
When using the docker container, it the speedup is as expected.
The dolfinx version is the same but the petsc version is different (conda: petsc4py 3.20.3, docker: 3.20.0).

Do you have an idea what I could do?
Thanks a lot

dokken · January 16, 2024, 3:21pm

Did you change to openmpi?
What is the output of: mpirun -h
For me it is:

mpirun (Open MPI) 4.1.6

which for me runs with no issue (using conda on ubuntu 22.04).

PhilippIce · January 16, 2024, 3:27pm

Weird! I am using ubuntu 20.04 and the mpi version is

mpirun (Open MPI) 4.1.6

so the same as you…

dparsons · January 16, 2024, 6:56pm

Since you’re running on Ubuntu, consider updating your installation to a more recent Ubuntu release, then see how the parallel run compares using the native ubuntu packages (from the FEniCS PPA) rather than the conda build.

PhilippIce · January 17, 2024, 3:44pm

Thanks a lot!
As everything seems to run well when I am in the docker container, I will probably stick to that for the moment.
I can’t really update to 22.04 right now as it is a shared workstation, but if that works once I was to upgrade, I will send an update here

Topic		Replies	Views
Parallel slower runtime dolfinx	6	977	February 19, 2024
Parallel much slower when creating high-order functionspace dolfinx mpi	4	54	October 6, 2024
Problems in running in parallel dolfinx	10	1165	April 21, 2023
Problem solving in parallel slower than in serial General	12	1388	April 4, 2023
Interpolation between non-matching mesh is extremely slow in dolfinx 0.7.1 dolfinx	1	241	November 24, 2023

Parallel slower runtime, MPI Anaconda

Related topics