Hi,
While implementing a pretty straightforward explicit dynamics example for a geometrically nonlinear vibrating beam, I noticed that the assembly of the residual vector is taking a surprisingly long time. I’m hoping that someone can give me some tips on speeding this up. Here follows a MWE:
from dolfin import *
# Physical parameters
L, b, h = 0.1, 0.006, 0.02
rho, E , nu = Constant(2.7E3), Constant(73E9), Constant(0.33)
mu = Constant(E/2/(1+nu))
lmbda = Constant(E*nu/(1+nu)/(1-2*nu))
# FE parameters
Nx, Ny, Nz = 30, 6, 8
P = 1
# Time-stepping parameters
dt = 1E-8
gamma = 0.5 # Newmark
beta = 0
# Mesh and function space
mesh = BoxMesh.create( [Point(0, -0.5*b, -0.5*h), Point(L, 0.5*b, 0.5*h)], [Nx, Ny, Nz],CellType.Type.hexahedron)
V = VectorFunctionSpace(mesh, 'Lagrange', degree = P)
u = TrialFunction(V)
v = TestFunction(V)
# Clamped left boundary
def left(x,on_boundary):
return near(x[0], 0.)
bc = DirichletBC(V, Constant((0, 0, 0)), left)
# Stress and strain operators
def eps(v):
# Linear strain
return sym(grad(v))
def eps_GL(v,v_):
# Green-Lagrange strain
return eps(v) + 0.5*grad(v).T*grad(v_)
def sigma(v,v_):
# Linear elastic stress:
return lmbda*tr(eps_GL(v,v_))*Identity(v.geometric_dimension())+2*mu*eps_GL(v,v_)
# RHS residual load
def Res(u,v=v):
F = assemble( -inner( sigma(u,u) , eps(v) )*dx )
bc.apply(F)
return F
# Mass matrix solver
M = PETScMatrix()
assemble( rho*inner(u,v)*dx, tensor=M)
bc.apply(M)
M_solver = LUSolver(M, 'mumps')
# Initial conditions
u_sol = interpolate( Expression((0,'0.01*(1-cos(x[0]))',0),degree=1),V)
dudt_sol = Function(V)
# Initial acceleration
dduddt_sol = Function(V)
M_solver.solve( dduddt_sol.vector(), Res( u_sol ) )
# Predictions
u_predict = Function(V)
dudt_predict = Function(V)
def perform_time_stepping():
# Wrapping this in a function purely for profiling purposes
for t_step in range(1000):
# Prediction of displacement and velocity
u_predict.vector()[:] = u_sol.vector()[:] + dt*dudt_sol.vector()[:] +dt**2/2*(1-2*beta)*dduddt_sol.vector()[:]
dudt_predict.vector()[:] = dudt_sol.vector()[:] + (1-gamma)*dt*dduddt_sol.vector()[:]
# Acceleration based on prediction
residual = Res( u_predict ) # <-- 80% of the time is spent here
M_solver.solve( dduddt_sol.vector(), residual )
# New solution based on prediction and acceleration
u_sol.vector()[:] = u_predict.vector()[:] + beta*dt**2*dduddt_sol.vector()[:]
dudt_sol.vector()[:] = dudt_predict.vector()[:] + gamma*dt*dduddt_sol.vector()[:]
perform_time_stepping()
In the above code, I solve 1000 time steps with an explicit central difference method (of the Newmark-method type). This takes roughly 20 seconds. Unsurprisingly, there are two major sources of computational cost:
- Inverting the mass matrix in every step (I use a consistent mass matrix and reuse the LU factorization, of course a diagonal mass matrix would all but remove this cost)
- Assembling the residual vector
I had expected the former to be way more significant, and the latter to be almost negligible due to the simplicity of the residual vector. However, after running:
python3 -m cProfile -o ./out.profile MWE.py
snakeviz ./out.profile
I find that 83% of the time is spent inside the “Res” function, while only 15% goes towards the mass-matrix solve-step.
How can I speed this up? Am I doing some unnecessary costly initialization each timestep? Should I preallocate something? Can I precompute a tri-linear form? Can I force row-based assembly to make parallelization very effective?