Nonlinear heat equation performance dolfin vs. dolfinx

LiborKudela · August 14, 2023, 9:10am

Hello,

I am transitioning to dolfinx from legacy-dolfin but I cannot reach the same performance when solving nonlinear time-dependet problem.

Consider this example in legacy dolfin:

from fenics import *
import ufl_legacy as ufl
from dolfin import __version__ as fenics_version
import os


import time
mpi_rank = MPI.rank(MPI.comm_world)
set_log_active(False)
if mpi_rank == 0:
    print(f"FeniCS version: {fenics_version}")

domain = Mesh()
with XDMFFile("meshes/simple_system/mesh.xdmf") as infile:
    infile.read(domain)

mvc = MeshValueCollection("size_t", domain, 3)
with XDMFFile("meshes/simple_system/mesh_subdomains.xdmf") as infile:
    infile.read(mvc)
subdomains = cpp.mesh.MeshFunctionSizet(domain, mvc)

dx = Measure("dx", domain=domain, subdomain_data=subdomains)

t = 0
t_max = 2*3600    
dt = 2
theta = 0.5
T0 = 20.0
Qc = 300 # W
alpha = 5 #W/m2K
T_amb = 20
initial_condition = Constant(20.0)

V = FunctionSpace(domain, "CG", 1)
T = Function(V)
T.interpolate(initial_condition)
T_n = Function(V)
T_n.interpolate(initial_condition)
T_v = TestFunction(V)

Vc = assemble(Constant(1)*dx)
q = Qc/Vc
T_v = TestFunction(V)

class Steel04:
    def __init__(self):
        pass
    
    def k(self, T):
        return 48.11-0.006*T-25.5e-6*T**2
    
    def rho(self, T):
        return 7850
    
    def cp(self, T):
        return 450
    
steel = Steel04()

F = steel.rho(T)*steel.cp(T)*T*T_v*dx - steel.rho(T_n)*steel.cp(T_n)*T_n*T_v*dx
F += theta*dt*inner(steel.k(T)*grad(T), grad(T_v))*dx + (1-theta)*dt*inner(steel.k(T_n)*grad(T_n), grad(T_v))*dx
F += theta*dt*((alpha*T*T_v) - (alpha*T_amb*T_v))*ds + (1-theta)*dt*((alpha*T_n*T_v) - (alpha*T_amb*T_v))*ds
F += -theta*dt*q*T_v*dx(3) - (1-theta)*dt*q*T_v*dx(3)

J = derivative(F, T)
problem = NonlinearVariationalProblem(F, T, [], J)
solver = NonlinearVariationalSolver(problem)
prm = solver.parameters
prm['newton_solver']['absolute_tolerance'] = 1e-6
prm['newton_solver']['relative_tolerance'] = 1e-6
prm['newton_solver']['linear_solver'] = 'gmres'
prm['newton_solver']['preconditioner'] = 'hypre_amg'
prm['newton_solver']['krylov_solver']['monitor_convergence'] = False

num_steps = int(t_max/dt)
for i in range(num_steps):
    t_start = time.time()
    t += dt

    solver.solve()
    T_n.assign(T)
    
    if mpi_rank == 0:
        print(f"t={t}s, iter time: {time.time() - t_start}s")

This returns the following output:

FeniCS version: 2019.2.0.dev0
t=2s, iter time: 3.6996588706970215s
t=4s, iter time: 1.8177123069763184s
t=6s, iter time: 1.8490257263183594s
t=8s, iter time: 1.8499982357025146s
t=10s, iter time: 1.7931668758392334s
t=12s, iter time: 1.7902030944824219s
t=14s, iter time: 1.809546947479248s
t=16s, iter time: 1.8374125957489014s

I reimplemented the same using dolfinx:

from mpi4py import MPI
from petsc4py import PETSc
from dolfinx import mesh, io, plot, fem, nls

import dolfinx
import numpy as np
import ufl

import time

if MPI.COMM_WORLD.rank == 0:
    print(f"Dolfinx version: {dolfinx.__version__}")

domain, cell_tags, facet_tags = io.gmshio.read_from_msh("meshes/simple_system/mesh.msh", MPI.COMM_WORLD, 0, gdim=3)
dx = ufl.Measure("dx", domain=domain, subdomain_data=cell_tags)
ds = ufl.Measure("ds", domain=domain, subdomain_data=facet_tags)

t = 0
t_max = 2*3600    
dt = 2
theta = 0.5
T0 = 20.0
Qc = 300 #W
alpha = 5 #W/m2K
T_amb = 20
initial_condition = lambda x: np.full((x.shape[1],), 20.0)

V = fem.FunctionSpace(domain, ("CG", 1))

T_n = fem.Function(V)
T_n.name = "T_n"
T_n.interpolate(initial_condition)

T = fem.Function(V)
T.name = "T"
T.interpolate(initial_condition)

Vc = fem.assemble_scalar(fem.form(1*dx(3)))
Vc = domain.comm.allreduce((Vc), op=MPI.SUM)
q = Qc/Vc
T_v = ufl.TestFunction(V)

class Steel04:
    def __init__(self):
        pass
    
    def k(self, T):
        return 48.11-0.006*T-25.5e-6*T**2
    
    def rho(self, T):
        return 7850
    
    def cp(self, T):
        return 450
    
steel = Steel04()

F = steel.rho(T)*steel.cp(T)*T*T_v*dx - steel.rho(T_n)*steel.cp(T_n)*T_n*T_v*dx
F += theta*dt*ufl.dot(steel.k(T)*ufl.grad(T), ufl.grad(T_v))*dx + (1-theta)*dt*ufl.dot(steel.k(T_n)*ufl.grad(T_n), ufl.grad(T_v))*dx
F += theta*dt*((alpha*T*T_v) - (alpha*T_amb*T_v))*ds + (1-theta)*dt*((alpha*T_n*T_v) - (alpha*T_amb*T_v))*ds
F += -theta*dt*q*T_v*dx(3) - (1-theta)*dt*q*T_v*dx(3)

problem = fem.petsc.NonlinearProblem(F, T)
solver = nls.petsc.NewtonSolver(MPI.COMM_WORLD, problem)
solver.convergence_criterion = "incremental"
solver.atol = 1e-6
solver.rtol = 1e-6

ksp = solver.krylov_solver
opts = PETSc.Options()
option_prefix = ksp.getOptionsPrefix()
opts[f"{option_prefix}ksp_type"] = "gmres"
opts[f"{option_prefix}pc_type"] = "gamg"
ksp.setFromOptions()

num_steps = int(t_max/dt)
for i in range(num_steps):
    t_start = time.time()
    t += dt

    r = solver.solve(T)
    T_n.x.array[:] = T.x.array
    
    if MPI.COMM_WORLD.rank == 0:
        print(f"t={t}s, iter time: {time.time() - t_start}s")

which gives following output:

Dolfinx version: 0.6.0
Info    : Reading 'meshes/simple_system/mesh.msh'...
Info    : 486 entities
Info    : 109049 nodes
Info    : 632456 elements                                               
Info    : Done reading 'meshes/simple_system/mesh.msh'                     
t=2s, iter time: 5.594205379486084s
t=4s, iter time: 5.5990447998046875s
t=6s, iter time: 5.569369316101074s
t=8s, iter time: 5.54309606552124s
t=10s, iter time: 5.508774280548096s
t=12s, iter time: 5.593702793121338s
t=14s, iter time: 5.546661376953125s
t=16s, iter time: 5.587730407714844s
t=18s, iter time: 5.540001392364502s

Therefore dolfinx is quite slower than legacy version.

I think that the dolfinx keeps recompiling the assembly/form or something but I fail to find a way to prevent that.
Do I have to create custom version of the whole timesteping loop to prevent JIT or is it caused by some other issue?

Mesh data are available here: issues - Google Drive

Thank you for any reply.

dokken · August 14, 2023, 12:52pm

First of all, you need to set what backend hypre should use for amg.

opts[f"{option_prefix}pc_hypre_type"] = "boomeramg"

You are also using different convergence criteria for the Newton solver (Bitbucket).

solver.convergence_criterion = "residual"

There might be more subtle differences in your code. I’ve not got more time right now to investigate it

dokken · August 14, 2023, 1:13pm

These are different, and you get different scaling and potentially different results and timings

LiborKudela · August 14, 2023, 1:45pm

Thank you for the reply @dokken,

You are correct, I missed the solver.convergence_criterion = "residual" option, which reduces the iteration time from ~5.5s to ~3.8s.

The first option does not seem to change the iteration time.

opts[f"{option_prefix}ksp_type"] = "gmres"
opts[f"{option_prefix}pc_type"] = "gamg"
opts[f"{option_prefix}pc_hypre_type"] = "boomeramg"

I have also noticed that the mesh I posted have some issues when running via mpirun. Proper fine mesh seems to make the iteration time similar.
mesh: issues - Google Drive

I have generated new mesh with finner elements and it seems to run with the same iteration time when using mpirun.

therefore current comparison is as follows:

mpirun -n 10
legacy dolfin ~ 2.66s
dolfinx ~ 3.32s

mpirun -n 20
legacy dolfin ~ 1.55s
dolfinx ~ 1.95s

mpirun -n 30
legacy dolfin ~ 1.15s
dolfinx ~ 1.53s

So dolfinx is much better now but still a little behind.

I will try to change the VC now.

LiborKudela · August 14, 2023, 1:58pm

Yes that was the last bit:

mpirun -n 10
legacy dolfin ~ 5.1s
dolfinx ~ 3.32s

mpirun -n 20
legacy dolfin ~ 2.9s
dolfinx ~ 1.95s

mpirun -n 30
legacy dolfin ~ 2.23s
dolfinx ~ 1.53s

Thank you very much @dokken !!

nate · August 14, 2023, 2:47pm

boomeramg is a component of hypre, not PETSc’s geometric algebraic multigrid (gamg).

You could likely get even better performance if you don’t rebuild the preconditioner each step too. See, e.g. KSPSetReusePreconditioner — PETSc 3.19.4 documentation and SNESSetLagPreconditioner — PETSc v3.19.4-959-g92f1e92e88f documentation.

LiborKudela · August 15, 2023, 11:50am

Thank you @nate, you are correct!
I have end up with this setting so far:

ksp = solver.krylov_solver
pc = ksp.getPC()
ksp.setType("gmres")
#PC.setHYPREType("Boomeramg") # slower with hypre instead of gamg
pc.setType("gamg")
#ksp.setReusePreconditioner(True) #object has no attribute setReusePreconditioner
pc.setReusePreconditioner(True) # works

which gives:

mpirun -n 30
dolfinx ~0.7s
dolfinx without reusePC ~1.55s

Regarding the hypre backend…

pc.setType("hypre")
PC.setHYPREType("Boomeramg")

is slower than

pc.setType("gamg")

A do not know the details of how hypre works, it might need some finer tunning…
I have tried some but it is always slower than GAMG.

nate · August 15, 2023, 1:57pm

Just bear in mind that the initially built preconditioner may not always be a good approximation of \mathrm{A}. I typically monitor the linear solve iteration count each time step and rebuild when that iteration count grows beyond a threshold.

LiborKudela · August 15, 2023, 2:41pm

Yes, thank you again. I am already on it:

num_steps = int(t_max/dt)
for i in range(num_steps):
    t_start = time.time()
    t += dt

    r = solver.solve(T)
    T_n.x.array[:] = T.x.array

    if MPI.COMM_WORLD.rank == 0:
        print(f"t={t:.2f}s, it: {time.time() - t_start:.3f}s, NLSit={r[0]}, KSPit={ksp.its}, KSPnorm={ksp.norm:.2e}")

    if t % 60 == 0:
        xdmf.write_function(T, t)

    if ksp.its > 15:
        pc.reset()

I sometimes use NLS iteration count for time step adaptation.

Just out of curiosity, there is some randomness involved when the KSP and PC get initialized right?
Sometimes I get runs like this:

Dolfinx version: 0.6.0
Info    : Reading 'meshes/simple_system/mesh.msh'...
Info    : 486 entities
Info    : 1160209 nodes
Info    : 7210891 elements                                              
Info    : Done reading 'meshes/simple_system/mesh.msh'                     
t=20.00s, it: 1.997s, NLSit=2, KSPit=13, KSPnorm=1.85e-08
t=40.00s, it: 1.938s, NLSit=2, KSPit=13, KSPnorm=1.52e-08
t=60.00s, it: 1.952s, NLSit=2, KSPit=13, KSPnorm=5.43e-09
t=80.00s, it: 2.033s, NLSit=2, KSPit=13, KSPnorm=4.63e-09
t=100.00s, it: 1.989s, NLSit=2, KSPit=14, KSPnorm=1.20e-09
t=120.00s, it: 1.975s, NLSit=2, KSPit=14, KSPnorm=8.85e-10
t=140.00s, it: 1.977s, NLSit=2, KSPit=14, KSPnorm=9.37e-10
t=160.00s, it: 1.984s, NLSit=2, KSPit=14, KSPnorm=6.93e-10
t=180.00s, it: 1.985s, NLSit=2, KSPit=14, KSPnorm=7.32e-10
t=200.00s, it: 1.977s, NLSit=2, KSPit=14, KSPnorm=5.74e-10
t=220.00s, it: 1.968s, NLSit=2, KSPit=14, KSPnorm=6.02e-10
t=240.00s, it: 2.064s, NLSit=2, KSPit=14, KSPnorm=4.98e-10
t=260.00s, it: 2.002s, NLSit=2, KSPit=14, KSPnorm=5.17e-10
t=280.00s, it: 2.012s, NLSit=2, KSPit=14, KSPnorm=4.45e-10
t=300.00s, it: 1.999s, NLSit=2, KSPit=14, KSPnorm=4.58e-10
t=320.00s, it: 1.979s, NLSit=2, KSPit=14, KSPnorm=4.07e-10

And sometimes like this (no change in code but I get half LS iterations):

Dolfinx version: 0.6.0
Info    : Reading 'meshes/simple_system/mesh.msh'...
Info    : 486 entities
Info    : 1160209 nodes
Info    : 7210891 elements                                              
Info    : Done reading 'meshes/simple_system/mesh.msh'                     
t=20.00s, it: 1.874s, NLSit=2, KSPit=7, KSPnorm=2.17e-08
t=40.00s, it: 1.110s, NLSit=2, KSPit=7, KSPnorm=1.35e-08
t=60.00s, it: 1.110s, NLSit=2, KSPit=7, KSPnorm=9.85e-09
t=80.00s, it: 1.107s, NLSit=2, KSPit=7, KSPnorm=7.27e-09
t=100.00s, it: 1.153s, NLSit=2, KSPit=7, KSPnorm=2.53e-09
t=120.00s, it: 1.135s, NLSit=2, KSPit=7, KSPnorm=1.80e-09
t=140.00s, it: 1.149s, NLSit=2, KSPit=7, KSPnorm=1.69e-09
t=160.00s, it: 1.127s, NLSit=2, KSPit=7, KSPnorm=1.33e-09
t=180.00s, it: 1.125s, NLSit=2, KSPit=7, KSPnorm=1.28e-09
t=200.00s, it: 1.123s, NLSit=2, KSPit=7, KSPnorm=1.09e-09
t=220.00s, it: 1.125s, NLSit=2, KSPit=7, KSPnorm=1.05e-09
t=240.00s, it: 1.128s, NLSit=2, KSPit=7, KSPnorm=9.45e-10
t=260.00s, it: 1.140s, NLSit=2, KSPit=7, KSPnorm=9.11e-10

Probably nothing I can control directly, other than tuning the PCType and so on…?

garth · August 16, 2023, 9:54am

GAMG uses Chebyshev smoothing, and I think by default PETSc uses a random RHS for estimating the eigenvalues used in the Chebyshev smoother, see KSPChebyshevEstEigSetUseNoisy — PETSc 3.19.4 documentation.

shrunkjungle · December 11, 2023, 3:31pm

What is NLSit? I understand that KSPit is the number of Linear solve iterations. i.e what does it actually mean in the log when it reads:

“INFO| Newton solver finished in 2 iterations and 24 linear solver iterations”

LiborKudela · December 11, 2023, 4:02pm

See this one-dimensional example:
1D newton solver animation

When we are solving a nonlinear problem using Newton solver, we first create tangent linear version of the problem and we solve it to find where it is zero (see the animation above crossing the x axis). This is essentially one newton solver iteration or more generally one Nonlinear solve iteration (NLSit).

This zero solution is then used to reassemble the matrices of new tangent linear problem at that point.
We repeat this process until the solution converges.

So KSPit are iteration of the linear solve that the newton solver uses to solve the tangent linear problems and the NLSit is the number of tangent linear problems it took to find solution of the nonlinear problem.

Kind of like the animation above shows, except in multi-variable setting (number of dofs).

shrunkjungle · December 11, 2023, 4:38pm

Very clear - thanks man!

Topic		Replies	Views
Dolfinx seems much slower than dolfin in solving nonlinear mechanics dolfinx dolfinx	6	160	June 22, 2025
PETSc NonlinearProblem running slow dolfinx	6	406	November 27, 2023
Nonlinear poisson as parallelisation example dolfinx mpi	18	142	January 23, 2025
Speeding up dolfinx code dolfinx performance	17	399	February 7, 2024
PETSc error code is: 73, Object is in wrong state (2D Nonlinear Reaction-Diffusion Equation) General	9	353	June 21, 2024

Nonlinear heat equation performance dolfin vs. dolfinx

Related topics