Assembling residual load-vector bottleneck

Stein · October 29, 2021, 1:22pm

Hi,

While implementing a pretty straightforward explicit dynamics example for a geometrically nonlinear vibrating beam, I noticed that the assembly of the residual vector is taking a surprisingly long time. I’m hoping that someone can give me some tips on speeding this up. Here follows a MWE:

from dolfin import *

# Physical parameters
L, b, h = 0.1, 0.006, 0.02
rho, E , nu = Constant(2.7E3), Constant(73E9), Constant(0.33)
mu = Constant(E/2/(1+nu))
lmbda = Constant(E*nu/(1+nu)/(1-2*nu))

# FE parameters
Nx, Ny, Nz = 30, 6, 8
P = 1

# Time-stepping parameters
dt = 1E-8
gamma = 0.5 # Newmark
beta = 0

# Mesh and function space
mesh = BoxMesh.create( [Point(0, -0.5*b, -0.5*h), Point(L, 0.5*b, 0.5*h)], [Nx, Ny, Nz],CellType.Type.hexahedron)
V = VectorFunctionSpace(mesh, 'Lagrange', degree = P)
u = TrialFunction(V)
v = TestFunction(V)

# Clamped left boundary
def left(x,on_boundary):
    return near(x[0], 0.)
bc = DirichletBC(V, Constant((0, 0, 0)), left)

# Stress and strain operators
def eps(v):
    # Linear strain
    return sym(grad(v))
def eps_GL(v,v_):
    # Green-Lagrange strain
    return eps(v) + 0.5*grad(v).T*grad(v_)
def sigma(v,v_):
    # Linear elastic stress:
    return lmbda*tr(eps_GL(v,v_))*Identity(v.geometric_dimension())+2*mu*eps_GL(v,v_)

# RHS residual load
def Res(u,v=v):
    F = assemble( -inner( sigma(u,u) , eps(v) )*dx )
    bc.apply(F)
    return F

# Mass matrix solver
M = PETScMatrix()
assemble( rho*inner(u,v)*dx, tensor=M)
bc.apply(M)
M_solver = LUSolver(M, 'mumps')

# Initial conditions
u_sol = interpolate( Expression((0,'0.01*(1-cos(x[0]))',0),degree=1),V)
dudt_sol = Function(V)

# Initial acceleration
dduddt_sol = Function(V)
M_solver.solve( dduddt_sol.vector(), Res( u_sol ) )

# Predictions
u_predict = Function(V)
dudt_predict = Function(V)

def perform_time_stepping():
    # Wrapping this in a function purely for profiling purposes
    for t_step in range(1000):
        # Prediction of displacement and velocity
        u_predict.vector()[:] = u_sol.vector()[:] + dt*dudt_sol.vector()[:] +dt**2/2*(1-2*beta)*dduddt_sol.vector()[:]
        dudt_predict.vector()[:] = dudt_sol.vector()[:] + (1-gamma)*dt*dduddt_sol.vector()[:]

        # Acceleration based on prediction
        residual = Res( u_predict ) # <-- 80% of the time is spent here
        M_solver.solve( dduddt_sol.vector(), residual )

        # New solution based on prediction and acceleration
        u_sol.vector()[:] = u_predict.vector()[:] + beta*dt**2*dduddt_sol.vector()[:]
        dudt_sol.vector()[:] = dudt_predict.vector()[:] + gamma*dt*dduddt_sol.vector()[:]

perform_time_stepping()

In the above code, I solve 1000 time steps with an explicit central difference method (of the Newmark-method type). This takes roughly 20 seconds. Unsurprisingly, there are two major sources of computational cost:

Inverting the mass matrix in every step (I use a consistent mass matrix and reuse the LU factorization, of course a diagonal mass matrix would all but remove this cost)
Assembling the residual vector

I had expected the former to be way more significant, and the latter to be almost negligible due to the simplicity of the residual vector. However, after running:

python3 -m cProfile -o ./out.profile MWE.py
snakeviz ./out.profile

I find that 83% of the time is spent inside the “Res” function, while only 15% goes towards the mass-matrix solve-step.

How can I speed this up? Am I doing some unnecessary costly initialization each timestep? Should I preallocate something? Can I precompute a tri-linear form? Can I force row-based assembly to make parallelization very effective?

nate · October 29, 2021, 3:05pm

Your assembly procedure is creating a new data structure each time. It’s probably straightforward for you to reuse a PETScVector. Something like:

...
residual = PETScVector()
...
def Res(u,v=v):
    assemble( -inner( sigma(u,u) , eps(v) )*dx, tensor=residual)
    bc.apply(residual)
...

Also if you can avoid recreating -inner( sigma(u,u) , eps(v) )*dx each time step you’ll have some savings. UFL is cheap when you do it once. Not when you use it every time step.

You could also consider setting up a SystemAssembler or Assembler object which will cache some of the operations required by assembly. It might not give a huge speedup though.

Stein · October 29, 2021, 4:13pm

Thanks for the suggestions. I’m afraid that they don’t change much though. I replaced the Res definition in the code by:

residual = PETScVector()
u_predict = Function(V)
a = -inner( sigma(u_predict,u_predict) , eps(v) )*dx
def Res(v=v,a=a,residual=residual):
    assemble( a , residual)
    bc.apply(residual)
    return residual

to pre-allocate the vector and to avoid re-initializing the form. I get a speed increase of ~1.8 seconds, but that still means the assembly of the load vector is ~80% of the cost.

I would like to try your Assembler suggestion, but can’t seem to find any documentation. It looks like I need to replace the Res function by:

res_assembler = Assembler()

and then call:

res_assembler.assemble( residual,a )

during the time-stepping. But that’s not working (it asks for a dolfin::GenericTensor as the first argument). Could you point me in the right direction?

Topic		Replies	Views
How to parallelize assembling and solving of PDE Linear Algebra	13	4098	December 2, 2020
Problem assembling global rhs and lhs system variational formulation	0	363	June 1, 2021
Accelerate Assembling of Mass Matrix from a Given Array of Functions variational formulation	1	672	April 27, 2020
Super slow code (3D elasticity problem)	6	2848	July 11, 2019
Assemble function takes long time after including stabilization method	3	435	April 11, 2020

Assembling residual load-vector bottleneck

Related topics