One thing you should add to the options are:
"pc_factor_mat_solver_type": "mumps",
to ensure that you use the same solver in serial and parallel.
Secondly, your problem is very small. Ramping up the size of it:
from dolfinx import mesh, fem, default_scalar_type
from dolfinx.fem.petsc import LinearProblem
from mpi4py import MPI
import ufl
import numpy as np
import time
L = 1
W = 0.2
mu = 1
rho = 1
delta = W / L
gamma = 0.4 * delta**2
beta = 1.25
lambda_ = beta
g = gamma
M = 40
N = 12
domain = mesh.create_box(
MPI.COMM_WORLD,
[np.array([0, 0, 0]), np.array([L, W, W])],
[M, N, N],
cell_type=mesh.CellType.hexahedron,
)
V = fem.functionspace(domain, ("Lagrange", 1, (domain.geometry.dim,)))
def clamped_boundary(x):
return np.isclose(x[0], 0)
fdim = domain.topology.dim - 1
boundary_facets = mesh.locate_entities_boundary(domain, fdim, clamped_boundary)
u_D = np.array([0, 0, 0], dtype=default_scalar_type)
bc = fem.dirichletbc(u_D, fem.locate_dofs_topological(V, fdim, boundary_facets), V)
T = fem.Constant(domain, default_scalar_type((0, 0, 0)))
ds = ufl.Measure("ds", domain=domain)
def epsilon(u):
return ufl.sym(
ufl.grad(u)
) # Equivalent to 0.5*(ufl.nabla_grad(u) + ufl.nabla_grad(u).T)
def sigma(u):
return lambda_ * ufl.nabla_div(u) * ufl.Identity(len(u)) + 2 * mu * epsilon(u)
u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
f = fem.Constant(domain, default_scalar_type((0, 0, -rho * g)))
a = ufl.inner(sigma(u), epsilon(v)) * ufl.dx
L = ufl.dot(f, v) * ufl.dx + ufl.dot(T, v) * ds
problem = LinearProblem(
a,
L,
bcs=[bc],
petsc_options={
"ksp_type": "preonly",
"pc_type": "lu",
"pc_factor_mat_solver_type": "mumps",
},
)
start_time = time.perf_counter()
uh = problem.solve()
end_time = time.perf_counter()
num_dofs_global = V.dofmap.index_map.size_global * V.dofmap.index_map_bs
print(f"{num_dofs_global=} Time = {end_time - start_time:.5f} (s)")
you get:
root@dokken-XPS-9320:~/shared# python3 mwe.py
num_dofs_global=20787 Time = 0.56405 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 2 python3 mwe.py
num_dofs_global=20787 Time = 0.44247 (s)
num_dofs_global=20787 Time = 0.44247 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 4 python3 mwe.py
num_dofs_global=20787 Time = 0.37886 (s)
num_dofs_global=20787 Time = 0.37886 (s)
num_dofs_global=20787 Time = 0.37886 (s)
num_dofs_global=20787 Time = 0.37886 (s)
further increasing the size of the problem, to
M = 80
N = 24
I get
root@dokken-XPS-9320:~/shared# python3 mwe.py
num_dofs_global=151875 Time = 14.51238 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 2 python3 mwe.py
num_dofs_global=151875 Time = 9.24733 (s)
num_dofs_global=151875 Time = 9.24733 (s)
root@dokken-XPS-9320:~/shared# mpirun -n 4 python3 mwe.py
num_dofs_global=151875 Time = 11.76198 (s)
num_dofs_global=151875 Time = 11.76198 (s)
num_dofs_global=151875 Time = 11.76197 (s)
num_dofs_global=151875 Time = 11.76198 (s)
where you see that with even 152 000 dofs, using 4 processes isn’t speeding up the code.
This is quite common with HPC/MPI finite element code. There is a quite large threshold for when one actually needs to use MPI parallelism.