Assembling residual load-vector bottleneck

Hi,

While implementing a pretty straightforward explicit dynamics example for a geometrically nonlinear vibrating beam, I noticed that the assembly of the residual vector is taking a surprisingly long time. I’m hoping that someone can give me some tips on speeding this up. Here follows a MWE:

from dolfin import *

# Physical parameters
L, b, h = 0.1, 0.006, 0.02
rho, E , nu = Constant(2.7E3), Constant(73E9), Constant(0.33)
mu = Constant(E/2/(1+nu))
lmbda = Constant(E*nu/(1+nu)/(1-2*nu))

# FE parameters
Nx, Ny, Nz = 30, 6, 8
P = 1

# Time-stepping parameters
dt = 1E-8
gamma = 0.5 # Newmark
beta = 0

# Mesh and function space
mesh = BoxMesh.create( [Point(0, -0.5*b, -0.5*h), Point(L, 0.5*b, 0.5*h)], [Nx, Ny, Nz],CellType.Type.hexahedron)
V = VectorFunctionSpace(mesh, 'Lagrange', degree = P)
u = TrialFunction(V)
v = TestFunction(V)

# Clamped left boundary
def left(x,on_boundary):
    return near(x[0], 0.)
bc = DirichletBC(V, Constant((0, 0, 0)), left)

# Stress and strain operators
def eps(v):
    # Linear strain
    return sym(grad(v))
def eps_GL(v,v_):
    # Green-Lagrange strain
    return eps(v) + 0.5*grad(v).T*grad(v_)
def sigma(v,v_):
    # Linear elastic stress:
    return lmbda*tr(eps_GL(v,v_))*Identity(v.geometric_dimension())+2*mu*eps_GL(v,v_)

# RHS residual load
def Res(u,v=v):
    F = assemble( -inner( sigma(u,u) , eps(v) )*dx )
    bc.apply(F)
    return F

# Mass matrix solver
M = PETScMatrix()
assemble( rho*inner(u,v)*dx, tensor=M)
bc.apply(M)
M_solver = LUSolver(M, 'mumps')

# Initial conditions
u_sol = interpolate( Expression((0,'0.01*(1-cos(x[0]))',0),degree=1),V)
dudt_sol = Function(V)

# Initial acceleration
dduddt_sol = Function(V)
M_solver.solve( dduddt_sol.vector(), Res( u_sol ) )

# Predictions
u_predict = Function(V)
dudt_predict = Function(V)

def perform_time_stepping():
    # Wrapping this in a function purely for profiling purposes
    for t_step in range(1000):
        # Prediction of displacement and velocity
        u_predict.vector()[:] = u_sol.vector()[:] + dt*dudt_sol.vector()[:] +dt**2/2*(1-2*beta)*dduddt_sol.vector()[:]
        dudt_predict.vector()[:] = dudt_sol.vector()[:] + (1-gamma)*dt*dduddt_sol.vector()[:]

        # Acceleration based on prediction
        residual = Res( u_predict ) # <-- 80% of the time is spent here
        M_solver.solve( dduddt_sol.vector(), residual )

        # New solution based on prediction and acceleration
        u_sol.vector()[:] = u_predict.vector()[:] + beta*dt**2*dduddt_sol.vector()[:]
        dudt_sol.vector()[:] = dudt_predict.vector()[:] + gamma*dt*dduddt_sol.vector()[:]

perform_time_stepping()

In the above code, I solve 1000 time steps with an explicit central difference method (of the Newmark-method type). This takes roughly 20 seconds. Unsurprisingly, there are two major sources of computational cost:

  • Inverting the mass matrix in every step (I use a consistent mass matrix and reuse the LU factorization, of course a diagonal mass matrix would all but remove this cost)
  • Assembling the residual vector

I had expected the former to be way more significant, and the latter to be almost negligible due to the simplicity of the residual vector. However, after running:

python3 -m cProfile -o ./out.profile MWE.py
snakeviz ./out.profile

I find that 83% of the time is spent inside the “Res” function, while only 15% goes towards the mass-matrix solve-step.

How can I speed this up? Am I doing some unnecessary costly initialization each timestep? Should I preallocate something? Can I precompute a tri-linear form? Can I force row-based assembly to make parallelization very effective?

Your assembly procedure is creating a new data structure each time. It’s probably straightforward for you to reuse a PETScVector. Something like:

...
residual = PETScVector()
...
def Res(u,v=v):
    assemble( -inner( sigma(u,u) , eps(v) )*dx, tensor=residual)
    bc.apply(residual)
...

Also if you can avoid recreating -inner( sigma(u,u) , eps(v) )*dx each time step you’ll have some savings. UFL is cheap when you do it once. Not when you use it every time step.

You could also consider setting up a SystemAssembler or Assembler object which will cache some of the operations required by assembly. It might not give a huge speedup though.

Thanks for the suggestions. I’m afraid that they don’t change much though. I replaced the Res definition in the code by:

residual = PETScVector()
u_predict = Function(V)
a = -inner( sigma(u_predict,u_predict) , eps(v) )*dx
def Res(v=v,a=a,residual=residual):
    assemble( a , residual)
    bc.apply(residual)
    return residual

to pre-allocate the vector and to avoid re-initializing the form. I get a speed increase of ~1.8 seconds, but that still means the assembly of the load vector is ~80% of the cost.

I would like to try your Assembler suggestion, but can’t seem to find any documentation. It looks like I need to replace the Res function by:

res_assembler = Assembler()

and then call:

res_assembler.assemble( residual,a )

during the time-stepping. But that’s not working (it asks for a dolfin::GenericTensor as the first argument). Could you point me in the right direction?