How to further speed up the calculation in doflinx?

Yingqi_Jia · March 30, 2021, 11:43am

I am trying to run the default tutorial The equations of linear elasticity — FEniCS-X tutorial in the docker container, and want to test the computational time of the solver by using the time package.

start_time = time.time()
problem = dolfinx.fem.LinearProblem(a, L, bcs=[bc], petsc_options={"ksp_type": "preonly", "pc_type": "lu"})
uh = problem.solve()
print('Time = %.3f (s)' %(time.time()-start_time))

I run this code by using the command:

mpirun --allow-run-as-root -n 2 python3 Demo_LinearElasticity.py

However, the computational time increases with the number of processors. And it seems the OS just run the same code for many times instead of using the parallel computation. I think dolfinx should have a faster speed and maybe I ran it in an inproper way.

root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 2 python3 Demo_LinearElasticity.py
Time = 0.383 (s)
Time = 0.383 (s)

root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 4 python3 Demo_LinearElasticity.py
Time = 0.397 (s)
Time = 0.402 (s)
Time = 0.433 (s)
Time = 0.453 (s)

root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 8 python3 Demo_LinearElasticity.py
Time = 0.536 (s)
Time = 0.534 (s)
Time = 0.530 (s)
Time = 0.528 (s)
Time = 0.531 (s)
Time = 0.554 (s)
Time = 0.544 (s)
Time = 0.553 (s)

dokken · March 30, 2021, 3:32pm

I cannot reproduce your issue with a docker container (dolfinx/dolfinx) with the following script:

L = 1
W = 0.2
mu = 1
rho = 1
delta = W/L
gamma = 0.4*delta**2
beta = 1.25
lambda_ = beta
g = gamma

import dolfinx
import numpy as np
from mpi4py import MPI
from dolfinx.cpp.mesh import CellType
import time

mesh = dolfinx.BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([L, W, W])], [20,6,6], cell_type=CellType.hexahedron)
V = dolfinx.VectorFunctionSpace(mesh, ("CG", 1))


def clamped_boundary(x):
    return np.isclose(x[0], 0)

fdim = mesh.topology.dim - 1
boundary_facets = dolfinx.mesh.locate_entities_boundary(mesh, fdim, clamped_boundary)

u_D = dolfinx.Function(V)
with u_D.vector.localForm() as loc:
    loc.set(0)
bc = dolfinx.DirichletBC(u_D, dolfinx.fem.locate_dofs_topological(V, fdim, boundary_facets))

T = dolfinx.Constant(mesh, (0, 0, 0))

import ufl
ds = ufl.Measure("ds", domain=mesh)

def epsilon(u):
    return ufl.sym(ufl.grad(u)) # Equivalent to 0.5*(ufl.nabla_grad(u) + ufl.nabla_grad(u).T)
def sigma(u):
    return lambda_ * ufl.nabla_div(u) * ufl.Identity(u.geometric_dimension()) + 2*mu*epsilon(u)

u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
f = dolfinx.Constant(mesh, (0, 0, -rho*g))
a = ufl.inner(sigma(u), epsilon(v)) * ufl.dx
L = ufl.dot(f, v) * ufl.dx + ufl.dot(T, v) * ds

problem = dolfinx.fem.LinearProblem(a, L, bcs=[bc], petsc_options={"ksp_type": "preonly", "pc_type": "lu"})
start = time.time()
uh = problem.solve()
end = time.time()
print(f'{MPI.COMM_WORLD.rank}: Time = {end-start:.3f} (s)')

and output:

root@feb3f65a1cf3:/home/shared# mpirun -n 1 python3 linearelasticity_code.py 
0: Time = 0.315 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 2 python3 linearelasticity_code.py 
0: Time = 0.115 (s)
1: Time = 0.115 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 3 python3 linearelasticity_code.py 
0: Time = 0.094 (s)
1: Time = 0.094 (s)
2: Time = 0.094 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 4 python3 linearelasticity_code.py 
0: Time = 0.081 (s)
1: Time = 0.081 (s)
2: Time = 0.081 (s)
3: Time = 0.081 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 8 python3 linearelasticity_code.py 
0: Time = 0.060 (s)
1: Time = 0.060 (s)
2: Time = 0.060 (s)
3: Time = 0.060 (s)
4: Time = 0.060 (s)
5: Time = 0.060 (s)
6: Time = 0.060 (s)
7: Time = 0.060 (s)

Yingqi_Jia · March 30, 2021, 4:22pm

Thanks, Dokken. I delete my old container and rerun a new one, now it works. But my improvement (from 0.36s to 0.14s) is not as obvious as yours (from 0.3s to 0.06s), maybe it is due to the difference of computers.

dokken · March 30, 2021, 5:57pm

Im using a desktop computer with 64 GB ram and 16 processes. For further speedups, I would suggest changing from a direct to an iterative solver (especially if you increase the number of dofs)

prjohnston · February 26, 2025, 4:57am

Hello,

I seem to be having the same problem with the same example. I am running fenicsx 0.9.0 under anaconda 3. The operating system is ubuntu 24.04 and the desktop computer has 16 cores and 32Gb of RAM.

In order to mimc the previous posts I created the following minimal code (cut down from linearelasticity_code.py and appropriate for version 0.9.0):

from dolfinx import mesh, fem, default_scalar_type
from dolfinx.fem.petsc import LinearProblem
from mpi4py import MPI
import ufl
import numpy as np
import time
L = 1
W = 0.2
mu = 1
rho = 1
delta = W / L
gamma = 0.4 * delta**2
beta = 1.25
lambda_ = beta
g = gamma

domain = mesh.create_box(MPI.COMM_WORLD, [np.array([0, 0, 0]), np.array([L, W, W])],
                         [20, 6, 6], cell_type=mesh.CellType.hexahedron)
V = fem.functionspace(domain, ("Lagrange", 1, (domain.geometry.dim, )))


def clamped_boundary(x):
    return np.isclose(x[0], 0)


fdim = domain.topology.dim - 1
boundary_facets = mesh.locate_entities_boundary(domain, fdim, clamped_boundary)

u_D = np.array([0, 0, 0], dtype=default_scalar_type)
bc = fem.dirichletbc(u_D, fem.locate_dofs_topological(V, fdim, boundary_facets), V)

T = fem.Constant(domain, default_scalar_type((0, 0, 0)))

ds = ufl.Measure("ds", domain=domain)

def epsilon(u):
    return ufl.sym(ufl.grad(u))  # Equivalent to 0.5*(ufl.nabla_grad(u) + ufl.nabla_grad(u).T)

def sigma(u):
    return lambda_ * ufl.nabla_div(u) * ufl.Identity(len(u)) + 2 * mu * epsilon(u)

u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
f = fem.Constant(domain, default_scalar_type((0, 0, -rho * g)))
a = ufl.inner(sigma(u), epsilon(v)) * ufl.dx
L = ufl.dot(f, v) * ufl.dx + ufl.dot(T, v) * ds


problem = LinearProblem(a, L, bcs=[bc], petsc_options={"ksp_type": "preonly", "pc_type": "lu"})
start_time = time.time()
uh = problem.solve()
end_time = time.time()
print('Time = %.5f (s)' %(end_time-start_time))

My results are as follows:

(fenicsx-env) peter:fenicsx % mpirun -n 1 python lin_elas.py             
Time = 0.18598 (s)
(fenicsx-env) peter:fenicsx % mpirun -n 2 python lin_elas.py
Time = 0.21867 (s)
Time = 0.21867 (s)
(fenicsx-env) peter:fenicsx % mpirun -n 4 python lin_elas.py
Time = 0.34490 (s)
Time = 0.34489 (s)
Time = 0.34490 (s)
Time = 0.34489 (s)
(fenicsx-env) peter:fenicsx % mpirun -n 8 python lin_elas.py
Time = 0.51458 (s)
Time = 0.51459 (s)
Time = 0.51552 (s)
Time = 0.51552 (s)
Time = 0.51551 (s)
Time = 0.51550 (s)
Time = 0.51551 (s)
Time = 0.51552 (s)

I assume that I must have missed something somewhere. I did run the program in a new window with a fresh start of conda. The only thing I haven’t tried is a reboot of the computer.

Any thoughts would be much appreciated.

Thank you,

Peter.

dokken · February 26, 2025, 1:46pm

One thing you should add to the options are:
"pc_factor_mat_solver_type": "mumps",
to ensure that you use the same solver in serial and parallel.
Secondly, your problem is very small. Ramping up the size of it:

from dolfinx import mesh, fem, default_scalar_type
from dolfinx.fem.petsc import LinearProblem
from mpi4py import MPI
import ufl
import numpy as np
import time

L = 1
W = 0.2
mu = 1
rho = 1
delta = W / L
gamma = 0.4 * delta**2
beta = 1.25
lambda_ = beta
g = gamma

M = 40
N = 12
domain = mesh.create_box(
    MPI.COMM_WORLD,
    [np.array([0, 0, 0]), np.array([L, W, W])],
    [M, N, N],
    cell_type=mesh.CellType.hexahedron,
)
V = fem.functionspace(domain, ("Lagrange", 1, (domain.geometry.dim,)))


def clamped_boundary(x):
    return np.isclose(x[0], 0)


fdim = domain.topology.dim - 1
boundary_facets = mesh.locate_entities_boundary(domain, fdim, clamped_boundary)

u_D = np.array([0, 0, 0], dtype=default_scalar_type)
bc = fem.dirichletbc(u_D, fem.locate_dofs_topological(V, fdim, boundary_facets), V)

T = fem.Constant(domain, default_scalar_type((0, 0, 0)))

ds = ufl.Measure("ds", domain=domain)


def epsilon(u):
    return ufl.sym(
        ufl.grad(u)
    )  # Equivalent to 0.5*(ufl.nabla_grad(u) + ufl.nabla_grad(u).T)


def sigma(u):
    return lambda_ * ufl.nabla_div(u) * ufl.Identity(len(u)) + 2 * mu * epsilon(u)


u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
f = fem.Constant(domain, default_scalar_type((0, 0, -rho * g)))
a = ufl.inner(sigma(u), epsilon(v)) * ufl.dx
L = ufl.dot(f, v) * ufl.dx + ufl.dot(T, v) * ds


problem = LinearProblem(
    a,
    L,
    bcs=[bc],
    petsc_options={
        "ksp_type": "preonly",
        "pc_type": "lu",
        "pc_factor_mat_solver_type": "mumps",
    },
)
start_time = time.perf_counter()
uh = problem.solve()
end_time = time.perf_counter()
num_dofs_global = V.dofmap.index_map.size_global * V.dofmap.index_map_bs
print(f"{num_dofs_global=} Time = {end_time - start_time:.5f} (s)")

you get:

root@dokken-XPS-9320:~/shared# python3 mwe.py 
num_dofs_global=20787 Time = 0.56405 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 2 python3 mwe.py 
num_dofs_global=20787 Time = 0.44247 (s)
num_dofs_global=20787 Time = 0.44247 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 4 python3 mwe.py 
num_dofs_global=20787 Time = 0.37886 (s)
num_dofs_global=20787 Time = 0.37886 (s)
num_dofs_global=20787 Time = 0.37886 (s)
num_dofs_global=20787 Time = 0.37886 (s)

further increasing the size of the problem, to

M = 80
N = 24

I get

root@dokken-XPS-9320:~/shared# python3 mwe.py 
num_dofs_global=151875 Time = 14.51238 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 2 python3 mwe.py 
num_dofs_global=151875 Time = 9.24733 (s)
num_dofs_global=151875 Time = 9.24733 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 4 python3 mwe.py 
num_dofs_global=151875 Time = 11.76198 (s)
num_dofs_global=151875 Time = 11.76198 (s)
num_dofs_global=151875 Time = 11.76197 (s)
num_dofs_global=151875 Time = 11.76198 (s)

where you see that with even 152 000 dofs, using 4 processes isn’t speeding up the code.

This is quite common with HPC/MPI finite element code. There is a quite large threshold for when one actually needs to use MPI parallelism.

Topic		Replies	Views
MPI acceleration with FEniCSx General	13	168	January 24, 2025
Parallel slower runtime, MPI Anaconda dolfinx	7	440	January 17, 2024
Conspicuous speed ups in parallel computing General	1	411	September 12, 2023
Speeding up Complex Electromagnetic simulations with dolfinx dolfinx	17	1822	July 28, 2021
Dolfinx crashes with a bit computation load; How? dolfinx mesh , dolfinx , mpi	10	223	March 3, 2024

How to further speed up the calculation in doflinx?

Related topics