Parallel much slower when creating high-order functionspace

Hello everyone,

When I am creating a toy high-order Lagrange functionspace, I find that my code get significantly slower with the increase of MPI processes. The same thing doesn’t happen for low-order functionspace. Here is my script:

import numpy as np
from dolfinx import cpp, mesh, fem
from mpi4py import MPI
import time

comm = MPI.COMM_WORLD

L, W, H=10, 10, 10
NX, NY, NZ = 100, 100, 100

points=[np.array([0, 0, 0]), np.array([L, W, H])]

start = time.perf_counter()
domain=mesh.create_box( 
    comm, 
    points,
    [NX,NY,NZ],
    cell_type=mesh.CellType.hexahedron,
    ghost_mode=mesh.GhostMode.shared_facet
    )
end_0 = time.perf_counter()
V = fem.functionspace(domain, ("Lagrange", 1))
end_1 = time.perf_counter()

owned_dof_num = V.dofmap.index_map.size_local
ghost_dof_num = V.dofmap.index_map.num_ghosts

starts = comm.gather(start, root=0)
end_0s = comm.gather(end_0, root=0)
end_1s = comm.gather(end_1, root=0)
owned_dof_nums = comm.gather(owned_dof_num, root=0)
ghost_dof_nums = comm.gather(ghost_dof_num, root=0)

if comm.rank == 0:
    print(f"average # of owned dofs {sum(owned_dof_nums)/comm.size} average # of ghost dofs {sum(ghost_dof_nums)/comm.size} average mesh time {(sum(end_0s)-sum(starts))/comm.size} average functionspace time {(sum(end_1s)-sum(end_0s))/comm.size}")

When the order is 1, the results are:

$ mpirun -np 1 python try.py 
average owned dofs 1030301.0 average ghost dofs 0.0 average meshing time 3.988608295097947 average functionspace time 0.12935276422649622
$ mpirun -np 2 python try.py 
average owned dofs 515150.5 average ghost dofs 15301.5 average meshing time 3.443682523444295 average functionspace time 0.07836870476603508
$ mpirun -np 4 python try.py 
average owned dofs 257575.25 average ghost dofs 17590.5 average meshing time 1.7969944467768073 average functionspace time 0.04848852753639221
$ mpirun -np 8 python try.py 
average owned dofs 128787.625 average ghost dofs 13583.125 average meshing time 0.9688909612596035 average functionspace time 0.03316149767488241

Everything goes well. However, when the Lagrange order is set to 4 (I decrease the mesh size as well to keep the dofs in the same magnitude, avoiding memory-related stuff):

NX, NY, NZ = 30, 30, 30
...
V = fem.functionspace(domain, ("Lagrange", 4))

The results are:

$ mpirun -np 1 python try.py 
average owned dofs 1771561.0 average ghost dofs 0.0 average meshing time 0.07663028221577406 average functionspace time 0.1555685205385089
$ mpirun -np 2 python try.py 
average owned dofs 885780.5 average ghost dofs 65884.5 average meshing time 0.07683719042688608 average functionspace time 5.366741458885372
$ mpirun -np 4 python try.py 
average owned dofs 442890.25 average ghost dofs 68716.75 average meshing time 0.07153452932834625 average functionspace time 9.33089439664036
$ mpirun -np 8 python try.py 
average owned dofs 221445.125 average ghost dofs 54979.875 average meshing time 0.061317757703363895 average functionspace time 14.544806653633714

I think this is partly due to the increase of ghost dofs in high-order functionspace. However, should the time overhead be so large?

I use breakpoints to monitor the time cost of each part of the code, and something weird happens. The main parallel overhead originates from two places in the def functionspace(...) in fem->function.py:

The first is basix.ufl.element(). I test this function solely and find that every function associated with the class FiniteElement in basix->finite_element.py (including basix.create_custom_element(), basix.create_element(), “basix.ufl.custom_element()” and so on) shows significant increase of overhead when running parallel, and this roots from the _create_custom_element and the _create_element nanobind functions.

The second is the code snippet:

ffi = module.ffi
    if np.issubdtype(dtype, np.float32):
        cpp_element = _cpp.fem.FiniteElement_float32(
            ffi.cast("uintptr_t", ffi.addressof(ufcx_element))
        )
    elif np.issubdtype(dtype, np.float64):
        cpp_element = _cpp.fem.FiniteElement_float64(
            ffi.cast("uintptr_t", ffi.addressof(ufcx_element))
        )

I’m now going to check the c++ code and find out what happens. Has anyone else encountered this issue?

Can’t replicate on my installation (from source, main as of yesterday), on a laptop which has 8 cores.

degree 4, number of elements equal to 30^3

$ mpirun -n 1 --oversubscribe python3 tmp.py 
average # of owned dofs 1771561.0 average # of ghost dofs 0.0 average mesh time 0.06909881900014625 average functionspace time 0.21692257099994094
$ mpirun -n 2 --oversubscribe python3 tmp.py 
average # of owned dofs 885780.5 average # of ghost dofs 65884.5 average mesh time 0.06259667700010141 average functionspace time 0.27963031650006087
$ mpirun -n 4 --oversubscribe python3 tmp.py 
average # of owned dofs 442890.25 average # of ghost dofs 68827.75 average mesh time 0.05672628475008423 average functionspace time 0.405795280250004
$ mpirun -n 8 --oversubscribe python3 tmp.py 
average # of owned dofs 221445.125 average # of ghost dofs 55010.375 average mesh time 0.06941070299990315 average functionspace time 0.47828329124990887
$ mpirun -n 16 --oversubscribe python3 tmp.py 
average # of owned dofs 110722.5625 average # of ghost dofs 44293.4375 average mesh time 0.4474661230001402 average functionspace time 1.3525908556873674
$ mpirun -n 32 --oversubscribe python3 tmp.py 
average # of owned dofs 55361.28125 average # of ghost dofs 31205.71875 average mesh time 0.7125168814063727 average functionspace time 2.436406357499891

Most likely this was either addressed in the meantime, or heavily depends on how you installed dolfinx.

1 Like

Thanks a lot. My system is Ubuntu 20.04 and I installed the 0.8.0 version of dolfinx using conda. I’m going to install the latest version instead.

The problem is solved after I installed the latest dolfinx version from conda (10/03/2024). Thank you again.