Mpi gather then sum vs allreduce for results calculation in dolfinx

dg0609 · March 27, 2023, 9:05pm

Hello,

I have a general question about mpi comm with dolfinx and integrated results. In the tutorial on custom newton solvers, the L2-errors are calculated with the mpicomm.allreduce like this L2_error.append(np.sqrt(mesh.comm.allreduce(dolfinx.fem.assemble_scalar(error), op=MPI.SUM))) while the flow past a cylinder tutorial calculates lift and drag with mpicomm.gather with a manual sum afterwards like this

...
drag_coeff = mesh.comm.gather(assemble_scalar(drag), root=0)
lift_coeff = mesh.comm.gather(assemble_scalar(lift), root=0)
... # some code ommitted
if mesh.comm.rank == 0:
    C_D[i] = sum(drag_coeff)
    C_L[i] = sum(lift_coeff)

Which method is better? If it depends, which one is better if I wanted to, for example, save C_D and C_L to a text file like this:

# write out data
if mesh.comm.rank == 0:
    dataOut = np.stack((C_D, C_L), axis=-1)
    with open("datafile.txt", "w") as txt_file:
        for line in dataOut:
            txt_file.write(" ".join(line) + "\n")

Thank you!

dokken · March 28, 2023, 7:24am

If you are using a lot of ranks, and you are only interested in writing data out on one process, using reduce with a root or gather with a root then sum is most sensible (as you use less memory).

It all depends on the MPI implementation (and the number of processes).
You can play around with the performance of this with:


from mpi4py import MPI
import time
comm = MPI.COMM_WORLD

if comm.rank == 0:
    a = 5
else:
    a = 10

comm.Barrier()
start = time.perf_counter()
t = comm.reduce(a, op=MPI.SUM, root=0)
comm.Barrier()
end = time.perf_counter()
print(f"Reduce to one rank {comm.rank=}, {t=} {end-start:.2e}")

comm.Barrier()
start = time.perf_counter()
s = comm.allreduce(a, op=MPI.SUM)
comm.Barrier()
end = time.perf_counter()

print(f"Allreduce {comm.rank=}, {s=} {end-start:.2e}")

comm.Barrier()
start = time.perf_counter()
v = comm.gather(a, root=0)
if comm.rank == 0:
    v = sum(v)
comm.Barrier()
end = time.perf_counter()

print(f"Gather+sum {comm.rank=}, {v=} {end-start:.2e}")

jackhale · March 28, 2023, 8:16am

Because the allreduce puts the same value on all processes it is less likely to cause subtle bugs and therefore I would prefer it.

The gather on root and sum is going to be more scalable, but it is probably only noticeable on quite huge runs (1000s processes? perhaps more.).

Topic		Replies	Views
MPI performance question dolfinx	1	497	May 12, 2022
Gathering function on root process (MPI) issue with mixed element General	2	144	February 20, 2024
Gather Function in parallel error	4	1324	July 9, 2019
Different results with MPI General	5	71	September 18, 2024
Efficient way to sum up fem.function's General	5	133	April 22, 2024

Mpi gather then sum vs allreduce for results calculation in dolfinx

Related topics