Mpi gather then sum vs allreduce for results calculation in dolfinx

Hello,

I have a general question about mpi comm with dolfinx and integrated results. In the tutorial on custom newton solvers, the L2-errors are calculated with the mpicomm.allreduce like this L2_error.append(np.sqrt(mesh.comm.allreduce(dolfinx.fem.assemble_scalar(error), op=MPI.SUM))) while the flow past a cylinder tutorial calculates lift and drag with mpicomm.gather with a manual sum afterwards like this

...
drag_coeff = mesh.comm.gather(assemble_scalar(drag), root=0)
lift_coeff = mesh.comm.gather(assemble_scalar(lift), root=0)
... # some code ommitted
if mesh.comm.rank == 0:
    C_D[i] = sum(drag_coeff)
    C_L[i] = sum(lift_coeff)

Which method is better? If it depends, which one is better if I wanted to, for example, save C_D and C_L to a text file like this:

# write out data
if mesh.comm.rank == 0:
    dataOut = np.stack((C_D, C_L), axis=-1)
    with open("datafile.txt", "w") as txt_file:
        for line in dataOut:
            txt_file.write(" ".join(line) + "\n") 

Thank you!

If you are using a lot of ranks, and you are only interested in writing data out on one process, using reduce with a root or gather with a root then sum is most sensible (as you use less memory).

It all depends on the MPI implementation (and the number of processes).
You can play around with the performance of this with:


from mpi4py import MPI
import time
comm = MPI.COMM_WORLD

if comm.rank == 0:
    a = 5
else:
    a = 10

comm.Barrier()
start = time.perf_counter()
t = comm.reduce(a, op=MPI.SUM, root=0)
comm.Barrier()
end = time.perf_counter()
print(f"Reduce to one rank {comm.rank=}, {t=} {end-start:.2e}")

comm.Barrier()
start = time.perf_counter()
s = comm.allreduce(a, op=MPI.SUM)
comm.Barrier()
end = time.perf_counter()

print(f"Allreduce {comm.rank=}, {s=} {end-start:.2e}")

comm.Barrier()
start = time.perf_counter()
v = comm.gather(a, root=0)
if comm.rank == 0:
    v = sum(v)
comm.Barrier()
end = time.perf_counter()

print(f"Gather+sum {comm.rank=}, {v=} {end-start:.2e}")


Because the allreduce puts the same value on all processes it is less likely to cause subtle bugs and therefore I would prefer it.

The gather on root and sum is going to be more scalable, but it is probably only noticeable on quite huge runs (1000s processes? perhaps more.).

2 Likes