How to further speed up the calculation in doflinx?

I am trying to run the default tutorial The equations of linear elasticity — FEniCS-X tutorial in the docker container, and want to test the computational time of the solver by using the time package.

start_time = time.time()
problem = dolfinx.fem.LinearProblem(a, L, bcs=[bc], petsc_options={"ksp_type": "preonly", "pc_type": "lu"})
uh = problem.solve()
print('Time = %.3f (s)' %(time.time()-start_time))

I run this code by using the command:

mpirun --allow-run-as-root -n 2 python3 Demo_LinearElasticity.py

However, the computational time increases with the number of processors. And it seems the OS just run the same code for many times instead of using the parallel computation. I think dolfinx should have a faster speed and maybe I ran it in an inproper way.

root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 2 python3 Demo_LinearElasticity.py
Time = 0.383 (s)
Time = 0.383 (s)

root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 4 python3 Demo_LinearElasticity.py
Time = 0.397 (s)
Time = 0.402 (s)
Time = 0.433 (s)
Time = 0.453 (s)

root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 8 python3 Demo_LinearElasticity.py
Time = 0.536 (s)
Time = 0.534 (s)
Time = 0.530 (s)
Time = 0.528 (s)
Time = 0.531 (s)
Time = 0.554 (s)
Time = 0.544 (s)
Time = 0.553 (s)

I cannot reproduce your issue with a docker container (dolfinx/dolfinx) with the following script:

L = 1
W = 0.2
mu = 1
rho = 1
delta = W/L
gamma = 0.4*delta**2
beta = 1.25
lambda_ = beta
g = gamma

import dolfinx
import numpy as np
from mpi4py import MPI
from dolfinx.cpp.mesh import CellType
import time

mesh = dolfinx.BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([L, W, W])], [20,6,6], cell_type=CellType.hexahedron)
V = dolfinx.VectorFunctionSpace(mesh, ("CG", 1))


def clamped_boundary(x):
    return np.isclose(x[0], 0)

fdim = mesh.topology.dim - 1
boundary_facets = dolfinx.mesh.locate_entities_boundary(mesh, fdim, clamped_boundary)

u_D = dolfinx.Function(V)
with u_D.vector.localForm() as loc:
    loc.set(0)
bc = dolfinx.DirichletBC(u_D, dolfinx.fem.locate_dofs_topological(V, fdim, boundary_facets))

T = dolfinx.Constant(mesh, (0, 0, 0))

import ufl
ds = ufl.Measure("ds", domain=mesh)

def epsilon(u):
    return ufl.sym(ufl.grad(u)) # Equivalent to 0.5*(ufl.nabla_grad(u) + ufl.nabla_grad(u).T)
def sigma(u):
    return lambda_ * ufl.nabla_div(u) * ufl.Identity(u.geometric_dimension()) + 2*mu*epsilon(u)

u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
f = dolfinx.Constant(mesh, (0, 0, -rho*g))
a = ufl.inner(sigma(u), epsilon(v)) * ufl.dx
L = ufl.dot(f, v) * ufl.dx + ufl.dot(T, v) * ds

problem = dolfinx.fem.LinearProblem(a, L, bcs=[bc], petsc_options={"ksp_type": "preonly", "pc_type": "lu"})
start = time.time()
uh = problem.solve()
end = time.time()
print(f'{MPI.COMM_WORLD.rank}: Time = {end-start:.3f} (s)')

and output:

root@feb3f65a1cf3:/home/shared# mpirun -n 1 python3 linearelasticity_code.py 
0: Time = 0.315 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 2 python3 linearelasticity_code.py 
0: Time = 0.115 (s)
1: Time = 0.115 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 3 python3 linearelasticity_code.py 
0: Time = 0.094 (s)
1: Time = 0.094 (s)
2: Time = 0.094 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 4 python3 linearelasticity_code.py 
0: Time = 0.081 (s)
1: Time = 0.081 (s)
2: Time = 0.081 (s)
3: Time = 0.081 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 8 python3 linearelasticity_code.py 
0: Time = 0.060 (s)
1: Time = 0.060 (s)
2: Time = 0.060 (s)
3: Time = 0.060 (s)
4: Time = 0.060 (s)
5: Time = 0.060 (s)
6: Time = 0.060 (s)
7: Time = 0.060 (s)
2 Likes

Thanks, Dokken. I delete my old container and rerun a new one, now it works. But my improvement (from 0.36s to 0.14s) is not as obvious as yours (from 0.3s to 0.06s), maybe it is due to the difference of computers.

Im using a desktop computer with 64 GB ram and 16 processes. For further speedups, I would suggest changing from a direct to an iterative solver (especially if you increase the number of dofs)

1 Like