However, the computational time increases with the number of processors. And it seems the OS just run the same code for many times instead of using the parallel computation. I think dolfinx should have a faster speed and maybe I ran it in an inproper way.
root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 2 python3 Demo_LinearElasticity.py
Time = 0.383 (s)
Time = 0.383 (s)
root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 4 python3 Demo_LinearElasticity.py
Time = 0.397 (s)
Time = 0.402 (s)
Time = 0.433 (s)
Time = 0.453 (s)
root@08637eaf5b16:/shared# mpirun --allow-run-as-root -n 8 python3 Demo_LinearElasticity.py
Time = 0.536 (s)
Time = 0.534 (s)
Time = 0.530 (s)
Time = 0.528 (s)
Time = 0.531 (s)
Time = 0.554 (s)
Time = 0.544 (s)
Time = 0.553 (s)
I cannot reproduce your issue with a docker container (dolfinx/dolfinx) with the following script:
L = 1
W = 0.2
mu = 1
rho = 1
delta = W/L
gamma = 0.4*delta**2
beta = 1.25
lambda_ = beta
g = gamma
import dolfinx
import numpy as np
from mpi4py import MPI
from dolfinx.cpp.mesh import CellType
import time
mesh = dolfinx.BoxMesh(MPI.COMM_WORLD, [np.array([0,0,0]), np.array([L, W, W])], [20,6,6], cell_type=CellType.hexahedron)
V = dolfinx.VectorFunctionSpace(mesh, ("CG", 1))
def clamped_boundary(x):
return np.isclose(x[0], 0)
fdim = mesh.topology.dim - 1
boundary_facets = dolfinx.mesh.locate_entities_boundary(mesh, fdim, clamped_boundary)
u_D = dolfinx.Function(V)
with u_D.vector.localForm() as loc:
loc.set(0)
bc = dolfinx.DirichletBC(u_D, dolfinx.fem.locate_dofs_topological(V, fdim, boundary_facets))
T = dolfinx.Constant(mesh, (0, 0, 0))
import ufl
ds = ufl.Measure("ds", domain=mesh)
def epsilon(u):
return ufl.sym(ufl.grad(u)) # Equivalent to 0.5*(ufl.nabla_grad(u) + ufl.nabla_grad(u).T)
def sigma(u):
return lambda_ * ufl.nabla_div(u) * ufl.Identity(u.geometric_dimension()) + 2*mu*epsilon(u)
u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
f = dolfinx.Constant(mesh, (0, 0, -rho*g))
a = ufl.inner(sigma(u), epsilon(v)) * ufl.dx
L = ufl.dot(f, v) * ufl.dx + ufl.dot(T, v) * ds
problem = dolfinx.fem.LinearProblem(a, L, bcs=[bc], petsc_options={"ksp_type": "preonly", "pc_type": "lu"})
start = time.time()
uh = problem.solve()
end = time.time()
print(f'{MPI.COMM_WORLD.rank}: Time = {end-start:.3f} (s)')
and output:
root@feb3f65a1cf3:/home/shared# mpirun -n 1 python3 linearelasticity_code.py
0: Time = 0.315 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 2 python3 linearelasticity_code.py
0: Time = 0.115 (s)
1: Time = 0.115 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 3 python3 linearelasticity_code.py
0: Time = 0.094 (s)
1: Time = 0.094 (s)
2: Time = 0.094 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 4 python3 linearelasticity_code.py
0: Time = 0.081 (s)
1: Time = 0.081 (s)
2: Time = 0.081 (s)
3: Time = 0.081 (s)
root@feb3f65a1cf3:/home/shared# mpirun -n 8 python3 linearelasticity_code.py
0: Time = 0.060 (s)
1: Time = 0.060 (s)
2: Time = 0.060 (s)
3: Time = 0.060 (s)
4: Time = 0.060 (s)
5: Time = 0.060 (s)
6: Time = 0.060 (s)
7: Time = 0.060 (s)
Thanks, Dokken. I delete my old container and rerun a new one, now it works. But my improvement (from 0.36s to 0.14s) is not as obvious as yours (from 0.3s to 0.06s), maybe it is due to the difference of computers.
Im using a desktop computer with 64 GB ram and 16 processes. For further speedups, I would suggest changing from a direct to an iterative solver (especially if you increase the number of dofs)