Nonlinear heat equation performance dolfin vs. dolfinx

Hello,

I am transitioning to dolfinx from legacy-dolfin but I cannot reach the same performance when solving nonlinear time-dependet problem.

Consider this example in legacy dolfin:

from fenics import *
import ufl_legacy as ufl
from dolfin import __version__ as fenics_version
import os


import time
mpi_rank = MPI.rank(MPI.comm_world)
set_log_active(False)
if mpi_rank == 0:
    print(f"FeniCS version: {fenics_version}")

domain = Mesh()
with XDMFFile("meshes/simple_system/mesh.xdmf") as infile:
    infile.read(domain)

mvc = MeshValueCollection("size_t", domain, 3)
with XDMFFile("meshes/simple_system/mesh_subdomains.xdmf") as infile:
    infile.read(mvc)
subdomains = cpp.mesh.MeshFunctionSizet(domain, mvc)

dx = Measure("dx", domain=domain, subdomain_data=subdomains)

t = 0
t_max = 2*3600    
dt = 2
theta = 0.5
T0 = 20.0
Qc = 300 # W
alpha = 5 #W/m2K
T_amb = 20
initial_condition = Constant(20.0)

V = FunctionSpace(domain, "CG", 1)
T = Function(V)
T.interpolate(initial_condition)
T_n = Function(V)
T_n.interpolate(initial_condition)
T_v = TestFunction(V)

Vc = assemble(Constant(1)*dx)
q = Qc/Vc
T_v = TestFunction(V)

class Steel04:
    def __init__(self):
        pass
    
    def k(self, T):
        return 48.11-0.006*T-25.5e-6*T**2
    
    def rho(self, T):
        return 7850
    
    def cp(self, T):
        return 450
    
steel = Steel04()

F = steel.rho(T)*steel.cp(T)*T*T_v*dx - steel.rho(T_n)*steel.cp(T_n)*T_n*T_v*dx
F += theta*dt*inner(steel.k(T)*grad(T), grad(T_v))*dx + (1-theta)*dt*inner(steel.k(T_n)*grad(T_n), grad(T_v))*dx
F += theta*dt*((alpha*T*T_v) - (alpha*T_amb*T_v))*ds + (1-theta)*dt*((alpha*T_n*T_v) - (alpha*T_amb*T_v))*ds
F += -theta*dt*q*T_v*dx(3) - (1-theta)*dt*q*T_v*dx(3)

J = derivative(F, T)
problem = NonlinearVariationalProblem(F, T, [], J)
solver = NonlinearVariationalSolver(problem)
prm = solver.parameters
prm['newton_solver']['absolute_tolerance'] = 1e-6
prm['newton_solver']['relative_tolerance'] = 1e-6
prm['newton_solver']['linear_solver'] = 'gmres'
prm['newton_solver']['preconditioner'] = 'hypre_amg'
prm['newton_solver']['krylov_solver']['monitor_convergence'] = False

num_steps = int(t_max/dt)
for i in range(num_steps):
    t_start = time.time()
    t += dt

    solver.solve()
    T_n.assign(T)
    
    if mpi_rank == 0:
        print(f"t={t}s, iter time: {time.time() - t_start}s")

This returns the following output:

FeniCS version: 2019.2.0.dev0
t=2s, iter time: 3.6996588706970215s
t=4s, iter time: 1.8177123069763184s
t=6s, iter time: 1.8490257263183594s
t=8s, iter time: 1.8499982357025146s
t=10s, iter time: 1.7931668758392334s
t=12s, iter time: 1.7902030944824219s
t=14s, iter time: 1.809546947479248s
t=16s, iter time: 1.8374125957489014s

I reimplemented the same using dolfinx:

from mpi4py import MPI
from petsc4py import PETSc
from dolfinx import mesh, io, plot, fem, nls

import dolfinx
import numpy as np
import ufl

import time

if MPI.COMM_WORLD.rank == 0:
    print(f"Dolfinx version: {dolfinx.__version__}")

domain, cell_tags, facet_tags = io.gmshio.read_from_msh("meshes/simple_system/mesh.msh", MPI.COMM_WORLD, 0, gdim=3)
dx = ufl.Measure("dx", domain=domain, subdomain_data=cell_tags)
ds = ufl.Measure("ds", domain=domain, subdomain_data=facet_tags)

t = 0
t_max = 2*3600    
dt = 2
theta = 0.5
T0 = 20.0
Qc = 300 #W
alpha = 5 #W/m2K
T_amb = 20
initial_condition = lambda x: np.full((x.shape[1],), 20.0)

V = fem.FunctionSpace(domain, ("CG", 1))

T_n = fem.Function(V)
T_n.name = "T_n"
T_n.interpolate(initial_condition)

T = fem.Function(V)
T.name = "T"
T.interpolate(initial_condition)

Vc = fem.assemble_scalar(fem.form(1*dx(3)))
Vc = domain.comm.allreduce((Vc), op=MPI.SUM)
q = Qc/Vc
T_v = ufl.TestFunction(V)

class Steel04:
    def __init__(self):
        pass
    
    def k(self, T):
        return 48.11-0.006*T-25.5e-6*T**2
    
    def rho(self, T):
        return 7850
    
    def cp(self, T):
        return 450
    
steel = Steel04()

F = steel.rho(T)*steel.cp(T)*T*T_v*dx - steel.rho(T_n)*steel.cp(T_n)*T_n*T_v*dx
F += theta*dt*ufl.dot(steel.k(T)*ufl.grad(T), ufl.grad(T_v))*dx + (1-theta)*dt*ufl.dot(steel.k(T_n)*ufl.grad(T_n), ufl.grad(T_v))*dx
F += theta*dt*((alpha*T*T_v) - (alpha*T_amb*T_v))*ds + (1-theta)*dt*((alpha*T_n*T_v) - (alpha*T_amb*T_v))*ds
F += -theta*dt*q*T_v*dx(3) - (1-theta)*dt*q*T_v*dx(3)

problem = fem.petsc.NonlinearProblem(F, T)
solver = nls.petsc.NewtonSolver(MPI.COMM_WORLD, problem)
solver.convergence_criterion = "incremental"
solver.atol = 1e-6
solver.rtol = 1e-6

ksp = solver.krylov_solver
opts = PETSc.Options()
option_prefix = ksp.getOptionsPrefix()
opts[f"{option_prefix}ksp_type"] = "gmres"
opts[f"{option_prefix}pc_type"] = "gamg"
ksp.setFromOptions()

num_steps = int(t_max/dt)
for i in range(num_steps):
    t_start = time.time()
    t += dt

    r = solver.solve(T)
    T_n.x.array[:] = T.x.array
    
    if MPI.COMM_WORLD.rank == 0:
        print(f"t={t}s, iter time: {time.time() - t_start}s")

which gives following output:

Dolfinx version: 0.6.0
Info    : Reading 'meshes/simple_system/mesh.msh'...
Info    : 486 entities
Info    : 109049 nodes
Info    : 632456 elements                                               
Info    : Done reading 'meshes/simple_system/mesh.msh'                     
t=2s, iter time: 5.594205379486084s
t=4s, iter time: 5.5990447998046875s
t=6s, iter time: 5.569369316101074s
t=8s, iter time: 5.54309606552124s
t=10s, iter time: 5.508774280548096s
t=12s, iter time: 5.593702793121338s
t=14s, iter time: 5.546661376953125s
t=16s, iter time: 5.587730407714844s
t=18s, iter time: 5.540001392364502s

Therefore dolfinx is quite slower than legacy version.

I think that the dolfinx keeps recompiling the assembly/form or something but I fail to find a way to prevent that.
Do I have to create custom version of the whole timesteping loop to prevent JIT or is it caused by some other issue?

Mesh data are available here: issues - Google Drive

Thank you for any reply.

First of all, you need to set what backend hypre should use for amg.

opts[f"{option_prefix}pc_hypre_type"] = "boomeramg"

You are also using different convergence criteria for the Newton solver (Bitbucket).

solver.convergence_criterion = "residual"

There might be more subtle differences in your code. I’ve not got more time right now to investigate it

These are different, and you get different scaling and potentially different results and timings

Thank you for the reply @dokken,

You are correct, I missed the solver.convergence_criterion = "residual" option, which reduces the iteration time from ~5.5s to ~3.8s.

The first option does not seem to change the iteration time.

opts[f"{option_prefix}ksp_type"] = "gmres"
opts[f"{option_prefix}pc_type"] = "gamg"
opts[f"{option_prefix}pc_hypre_type"] = "boomeramg"

I have also noticed that the mesh I posted have some issues when running via mpirun. Proper fine mesh seems to make the iteration time similar.
mesh: issues - Google Drive

I have generated new mesh with finner elements and it seems to run with the same iteration time when using mpirun.

therefore current comparison is as follows:

mpirun -n 10
legacy dolfin ~ 2.66s
dolfinx ~ 3.32s
mpirun -n 20
legacy dolfin ~ 1.55s
dolfinx ~ 1.95s
mpirun -n 30
legacy dolfin ~ 1.15s
dolfinx ~ 1.53s

So dolfinx is much better now but still a little behind.

I will try to change the VC now.

Yes that was the last bit:

mpirun -n 10
legacy dolfin ~ 5.1s
dolfinx ~ 3.32s
mpirun -n 20
legacy dolfin ~ 2.9s
dolfinx ~ 1.95s
mpirun -n 30
legacy dolfin ~ 2.23s
dolfinx ~ 1.53s

Thank you very much @dokken !!

boomeramg is a component of hypre, not PETSc’s geometric algebraic multigrid (gamg).

You could likely get even better performance if you don’t rebuild the preconditioner each step too. See, e.g. KSPSetReusePreconditioner — PETSc 3.19.4 documentation and SNESSetLagPreconditioner — PETSc v3.19.4-959-g92f1e92e88f documentation.

1 Like

Thank you @nate, you are correct!
I have end up with this setting so far:

ksp = solver.krylov_solver
pc = ksp.getPC()
ksp.setType("gmres")
#PC.setHYPREType("Boomeramg") # slower with hypre instead of gamg
pc.setType("gamg")
#ksp.setReusePreconditioner(True) #object has no attribute setReusePreconditioner
pc.setReusePreconditioner(True) # works

which gives:

mpirun -n 30
dolfinx ~0.7s
dolfinx without reusePC ~1.55s

Regarding the hypre backend…

pc.setType("hypre")
PC.setHYPREType("Boomeramg")

is slower than

pc.setType("gamg")

A do not know the details of how hypre works, it might need some finer tunning…
I have tried some but it is always slower than GAMG.

Just bear in mind that the initially built preconditioner may not always be a good approximation of \mathrm{A}. I typically monitor the linear solve iteration count each time step and rebuild when that iteration count grows beyond a threshold.

Yes, thank you again. I am already on it:

num_steps = int(t_max/dt)
for i in range(num_steps):
    t_start = time.time()
    t += dt

    r = solver.solve(T)
    T_n.x.array[:] = T.x.array

    if MPI.COMM_WORLD.rank == 0:
        print(f"t={t:.2f}s, it: {time.time() - t_start:.3f}s, NLSit={r[0]}, KSPit={ksp.its}, KSPnorm={ksp.norm:.2e}")

    if t % 60 == 0:
        xdmf.write_function(T, t)

    if ksp.its > 15:
        pc.reset()

I sometimes use NLS iteration count for time step adaptation.

Just out of curiosity, there is some randomness involved when the KSP and PC get initialized right?
Sometimes I get runs like this:

Dolfinx version: 0.6.0
Info    : Reading 'meshes/simple_system/mesh.msh'...
Info    : 486 entities
Info    : 1160209 nodes
Info    : 7210891 elements                                              
Info    : Done reading 'meshes/simple_system/mesh.msh'                     
t=20.00s, it: 1.997s, NLSit=2, KSPit=13, KSPnorm=1.85e-08
t=40.00s, it: 1.938s, NLSit=2, KSPit=13, KSPnorm=1.52e-08
t=60.00s, it: 1.952s, NLSit=2, KSPit=13, KSPnorm=5.43e-09
t=80.00s, it: 2.033s, NLSit=2, KSPit=13, KSPnorm=4.63e-09
t=100.00s, it: 1.989s, NLSit=2, KSPit=14, KSPnorm=1.20e-09
t=120.00s, it: 1.975s, NLSit=2, KSPit=14, KSPnorm=8.85e-10
t=140.00s, it: 1.977s, NLSit=2, KSPit=14, KSPnorm=9.37e-10
t=160.00s, it: 1.984s, NLSit=2, KSPit=14, KSPnorm=6.93e-10
t=180.00s, it: 1.985s, NLSit=2, KSPit=14, KSPnorm=7.32e-10
t=200.00s, it: 1.977s, NLSit=2, KSPit=14, KSPnorm=5.74e-10
t=220.00s, it: 1.968s, NLSit=2, KSPit=14, KSPnorm=6.02e-10
t=240.00s, it: 2.064s, NLSit=2, KSPit=14, KSPnorm=4.98e-10
t=260.00s, it: 2.002s, NLSit=2, KSPit=14, KSPnorm=5.17e-10
t=280.00s, it: 2.012s, NLSit=2, KSPit=14, KSPnorm=4.45e-10
t=300.00s, it: 1.999s, NLSit=2, KSPit=14, KSPnorm=4.58e-10
t=320.00s, it: 1.979s, NLSit=2, KSPit=14, KSPnorm=4.07e-10

And sometimes like this (no change in code but I get half LS iterations):

Dolfinx version: 0.6.0
Info    : Reading 'meshes/simple_system/mesh.msh'...
Info    : 486 entities
Info    : 1160209 nodes
Info    : 7210891 elements                                              
Info    : Done reading 'meshes/simple_system/mesh.msh'                     
t=20.00s, it: 1.874s, NLSit=2, KSPit=7, KSPnorm=2.17e-08
t=40.00s, it: 1.110s, NLSit=2, KSPit=7, KSPnorm=1.35e-08
t=60.00s, it: 1.110s, NLSit=2, KSPit=7, KSPnorm=9.85e-09
t=80.00s, it: 1.107s, NLSit=2, KSPit=7, KSPnorm=7.27e-09
t=100.00s, it: 1.153s, NLSit=2, KSPit=7, KSPnorm=2.53e-09
t=120.00s, it: 1.135s, NLSit=2, KSPit=7, KSPnorm=1.80e-09
t=140.00s, it: 1.149s, NLSit=2, KSPit=7, KSPnorm=1.69e-09
t=160.00s, it: 1.127s, NLSit=2, KSPit=7, KSPnorm=1.33e-09
t=180.00s, it: 1.125s, NLSit=2, KSPit=7, KSPnorm=1.28e-09
t=200.00s, it: 1.123s, NLSit=2, KSPit=7, KSPnorm=1.09e-09
t=220.00s, it: 1.125s, NLSit=2, KSPit=7, KSPnorm=1.05e-09
t=240.00s, it: 1.128s, NLSit=2, KSPit=7, KSPnorm=9.45e-10
t=260.00s, it: 1.140s, NLSit=2, KSPit=7, KSPnorm=9.11e-10

Probably nothing I can control directly, other than tuning the PCType and so on…?

GAMG uses Chebyshev smoothing, and I think by default PETSc uses a random RHS for estimating the eigenvalues used in the Chebyshev smoother, see KSPChebyshevEstEigSetUseNoisy — PETSc 3.19.4 documentation.

2 Likes

What is NLSit? I understand that KSPit is the number of Linear solve iterations. i.e what does it actually mean in the log when it reads:

“INFO| Newton solver finished in 2 iterations and 24 linear solver iterations”

See this one-dimensional example:
1D newton solver animation

When we are solving a nonlinear problem using Newton solver, we first create tangent linear version of the problem and we solve it to find where it is zero (see the animation above crossing the x axis). This is essentially one newton solver iteration or more generally one Nonlinear solve iteration (NLSit).

This zero solution is then used to reassemble the matrices of new tangent linear problem at that point.
We repeat this process until the solution converges.

So KSPit are iteration of the linear solve that the newton solver uses to solve the tangent linear problems and the NLSit is the number of tangent linear problems it took to find solution of the nonlinear problem.

Kind of like the animation above shows, except in multi-variable setting (number of dofs).

2 Likes

Very clear - thanks man!