Slower performance with increasing MPI ranks

Aris_Giapoutzis · February 20, 2026, 10:47am

I am running a phase-field simulation in FEniCSx with mesh refinement,
using SNES (newtonls) + GMRES + hypre boomeramg.

I observe that the code becomes slower when increasing MPI ranks.

Heres is my code :

import os
from pathlib import Path

from mpi4py import MPI
from petsc4py import PETSc
import dolfinx
import numpy as np
import ngsPETSc.utils.fenicsx as ngfx
from packaging.version import Version
from netgen.geom2d import SplineGeometry
from ufl import (
    FacetNormal,
    Identity,
    TestFunction,
    TrialFunction,
    div,
    dot,
    ds,
    dx,
    inner,
    lhs,
    nabla_grad,
    rhs,
    sym,
    TestFunctions,
    split,
    variable,
    diff,
    inner,
    grad,
    as_vector,
    sqrt
)
from basix.ufl import element, mixed_element
from dolfinx import default_real_type, log, plot,mesh,geometry 
from dolfinx.fem import(
    Function,
    functionspace,
    Expression,
    Constant,
    Function,
    extract_function_spaces,
    functionspace,
    assemble_scalar,
    dirichletbc,
    form,
    locate_dofs_geometrical,
)
from dolfinx.fem.petsc import NonlinearProblem
from dolfinx.io import XDMFFile



from dolfinx.plot import vtk_mesh
from dolfinx import fem
import ufl

# Save all logging to file
log.set_output_file("log.txt")
rank = MPI.COMM_WORLD.rank
######ORIZW TIS PARAMETROUS POU EINAI STATHERES 
r0=5
e4=0.05
Delta=0.65
W0=1
D=3
tau0=1
l=1.5957*D
lamda =1.5957*D
dt =0.01  ## XRONIKO BIMA 
t=0 
T= 285 # XRONIKI PERIODOS 

######################### MESH NETGEN
geo=SplineGeometry()
pnts= [(-800,-800),(800,-800),(800,800),(-800,800)]
p1,p2,p3,p4 = [geo.AppendPoint(*pnt) for pnt in pnts]
curves = [
    [["line",p1,p2],"line"],
    [["line",p2,p3],"line"],
    [["line",p3,p4],"line"],
    [["line",p4,p1],"line"],
]

for c,bc in curves : 
    geo.Append(c,bc=bc)

#####################################
geoModel = ngfx.GeometricModel(geo, MPI.COMM_WORLD)   
domain, (ct, ft), region_map = geoModel.model_to_mesh(gdim=2, hmax=100)
########################################################
def mark_cells_radius_DG(domain, R_ref, center=(0.0, 0.0)):
    W = fem.functionspace(domain, ("DG", 0))

    x = ufl.SpatialCoordinate(domain)
    cx, cy = center
    r = sqrt((x[0] - cx)**2 + (x[1] - cy)**2)

    indicator = ufl.conditional(ufl.le(r, R_ref), 1.0, 0.0)

    markers = fem.Function(W)

    ip = W.element.interpolation_points
    if Version(dolfinx.__version__) < Version("0.10.0"):
        ip = ip()

    markers.interpolate(fem.Expression(indicator, ip))

    # --- IMPORTANT: keep only owned cells (exclude ghosts) ---
    tdim = domain.topology.dim
    n_owned = domain.topology.index_map(tdim).size_local
    arr_owned = markers.x.array[:n_owned]

    cells = np.flatnonzero(arr_owned > 0.5).astype(np.int32)
    return cells

mesh_id = 0
out_folder = Path("code_4_8cores")
out_folder.mkdir(parents=True, exist_ok=True)

# γράψε αρχικό mesh


# geometric initial refinement
n_init_ref = 7
R_ref = 185.0

tdim = domain.topology.dim

# καλό να υπάρχει connectivity που χρειάζεται το compute_incident_entities
domain.topology.create_connectivity(tdim, tdim-1)

for k in range(n_init_ref):
    cells_local = mark_cells_radius_DG(domain, R_ref, center=(0.0, 0.0))

    # 100% ασφάλεια: μόνο owned (όχι ghost) ΚΑΙ σωστό dtype
    n_owned = domain.topology.index_map(tdim).size_local
    cells_local = cells_local[cells_local < n_owned].astype(np.int32, copy=False)

    # Debug check (αν θες)
    # assert cells_local.size == 0 or (cells_local.min() >= 0 and cells_local.max() < n_owned)

    domain, (ct, ft) = geoModel.refineMarkedElements(tdim, cells_local)


    

    
###################### dokimazw na plotarw to mesh na dw pws fainetai pyhto


######################################


import numpy as np
from dolfinx import geometry
from mpi4py import MPI

class TipTracker:
    def __init__(self, mesh, x_min, x_max=0.0, y0=0.0, npts=5000,
                 out_path="tip_velocity.csv", comm=None):
        self.mesh = mesh
        self.comm = comm if comm is not None else MPI.COMM_WORLD
        self.rank = self.comm.rank

        self.x_min = float(x_min)
        self.x_max = float(x_max)
        self.y0 = float(y0)
        self.npts = int(npts)

        # sampling points (ίδια σε όλους)
        self.xs = np.linspace(self.x_min, self.x_max, self.npts)
        self.pts = np.zeros((self.npts, 3), dtype=np.float64)
        self.pts[:, 0] = self.xs
        self.pts[:, 1] = self.y0
        self.pts[:, 2] = 0.0

        # local bb_tree
        self.bb = geometry.bb_tree(self.mesh, self.mesh.topology.dim)

        # state
        self.x_tip_old = None

        # file (μόνο rank 0)
        self.f = None
        if self.rank == 0:
            self.f = open(out_path, "w", buffering=1)
            self.f.write("step,t,x_tip,v_tip\n")

    def _local_eval_on_points(self, phi):
        """
        Κάθε rank αξιολογεί φ μόνο για points που πέφτουν στα local cells του.
        Επιστρέφει (idx_local, vals_local) όπου idx_local είναι indices στο [0..npts-1].
        """
        candidates = geometry.compute_collisions_points(self.bb, self.pts)
        colliding = geometry.compute_colliding_cells(self.mesh, candidates, self.pts)

        cells = np.full(self.npts, -1, dtype=np.int32)
        for i in range(self.npts):
            links = colliding.links(i)
            if len(links) > 0:
                cells[i] = links[0]

        valid = cells >= 0
        idx = np.where(valid)[0]
        if idx.size == 0:
            return idx.astype(np.int32), np.array([], dtype=np.float64)

        vals = phi.eval(self.pts[valid], cells[valid]).reshape(-1).astype(np.float64)
        return idx.astype(np.int32), vals

    @staticmethod
    def _compute_tip_from_line(xs, vals):
        """
        xs: (N,) x grid
        vals: (N,) φ(x,0) (με NaN σε σημεία που δεν αξιολογήθηκαν)
        Επιστρέφει x_tip από ΠΡΩΤΟ zero-crossing (πιο αριστερό). Αν δεν υπάρχει, min|φ|.
        """
        ok = np.isfinite(vals)
        if np.count_nonzero(ok) < 2:
            return None

        xv = xs[ok]
        v = vals[ok]

        s = np.sign(v)
        idx = np.where(s[:-1] * s[1:] < 0)[0]
        if idx.size > 0:
            j = idx[0]
            x0, x1 = xv[j], xv[j + 1]
            f0, f1 = v[j], v[j + 1]
            return float(x0 - f0 * (x1 - x0) / (f1 - f0))

        j = int(np.argmin(np.abs(v)))
        return float(xv[j])

    def update(self, phi, step, t, dt):
        # 1) local evaluations
        idx_local, vals_local = self._local_eval_on_points(phi)

        # 2) gather indices + values to rank 0
        all_idx  = self.comm.gather(idx_local, root=0)
        all_vals = self.comm.gather(vals_local, root=0)

        x_tip = None
        v_tip = 0.0

        if self.rank == 0:
            # 3) stitch full line values
            vals = np.full(self.npts, np.nan, dtype=np.float64)
            for idx, vv in zip(all_idx, all_vals):
                vals[idx] = vv  # κάθε point πρέπει να “ανήκει” σε ένα rank

            # 4) compute tip from the full line
            x_tip = self._compute_tip_from_line(self.xs, vals)

            # 5) velocity
            if x_tip is None or self.x_tip_old is None:
                v_tip = 0.0
            else:
                v_tip = (x_tip - self.x_tip_old) / dt

            self.x_tip_old = x_tip

            # 6) write
            if self.f is not None:
                self.f.write(f"{step},{t},{x_tip},{v_tip}\n")

        # 7) broadcast x_tip ώστε όλοι να έχουν ίδιο state (προαιρετικό αλλά καλό)
        x_tip = self.comm.bcast(x_tip, root=0)
        v_tip = self.comm.bcast(v_tip, root=0)
        self.x_tip_old = x_tip

        return x_tip, v_tip

    def close(self):
        if self.rank == 0 and self.f is not None:
            self.f.close()


##################################
def tau(phi):
    gradphi=grad(phi)
    absgradphi = sqrt(inner( gradphi,  gradphi)+1e-12) 
    abs4gradphi=absgradphi**4
    sum_grad_phi4= gradphi[0]**4 + gradphi[1]**4
    whole = sum_grad_phi4/abs4gradphi
    a=(1-3*e4)*(1+((4*e4)/(1-3*e4))*whole)
    return tau0*(a)**2    

def W(phi):
    gradphi=grad(phi)
    absgradphi = sqrt(inner( gradphi,  gradphi)+1e-12) 
    abs4gradphi=absgradphi**4
    sum_grad_phi4= gradphi[0]**4 + gradphi[1]**4
    whole = sum_grad_phi4/abs4gradphi
    a=(1-3*e4)*(1+((4*e4)/(1-3*e4))*whole)
    return W0*a

def omega(phi):
    gradphi=grad(phi)
    absgradphi = sqrt(inner( gradphi,  gradphi)+1e-12) 
    abs4gradphi=absgradphi**4
    V1=  gradphi[0]*(( gradphi[1]**2)*(( gradphi[0]**2)-(  gradphi[1]**2)))
    V2=  gradphi[1]*(( gradphi[0]**2)*(( gradphi[1]**2)-(  gradphi[0]**2)))
    V=as_vector([V1,V2])
    return ((16*e4*W0)/abs4gradphi)*V

def f(phi,u):
    return (phi-l*u*(1-phi**2))*(1-phi**2)


def build_system(domain, dt, D, petsc_options,
                 old_sol=None, old_sol0=None,
                 init_phi=None, init_u=None,
                 petsc_prefix="Crystal_growth"):
    """
    Χτίζει: ME, sol, sol0, problem  πάνω στο current domain (mesh).

    - An old_sol/old_sol0 δοθούν: μεταφέρει λύση nonmatching (για mesh change).
    - Αλλιώς, an init_phi/init_u δοθούν: βάζει τα initial initial conditions .
    """

    # 1) Spaces + functions
    P1 = element("Lagrange", domain.basix_cell(), 1, dtype=default_real_type)
    ME = fem.functionspace(domain, mixed_element([P1, P1]))

    sol  = fem.Function(ME)
    sol0 = fem.Function(ME)

    # 2) είτε μεταφορά λύσης (mesh changed) είτε initial conditions
    if old_sol is not None and old_sol0 is not None:
        # Old components
        old_phi  = old_sol.sub(0)
        old_u    = old_sol.sub(1)
        old_phi0 = old_sol0.sub(0)
        old_u0   = old_sol0.sub(1)

        # Collapsed spaces + maps
        Vphi_new, map_phi = ME.sub(0).collapse()
        Vu_new,   map_u   = ME.sub(1).collapse()

        phi_new  = fem.Function(Vphi_new)
        u_new    = fem.Function(Vu_new)
        phi0_new = fem.Function(Vphi_new)
        u0_new   = fem.Function(Vu_new)

        # Nonmatching interpolation data
        tdim = domain.topology.dim
        num_local_cells = domain.topology.index_map(tdim).size_local
        new_cells = np.arange(num_local_cells, dtype=np.int32)

        idata_phi = fem.create_interpolation_data(Vphi_new, old_phi.function_space, new_cells)
        idata_u   = fem.create_interpolation_data(Vu_new,   old_u.function_space,   new_cells)

        # Interpolate old -> new
        phi_new.interpolate_nonmatching(old_phi,   new_cells, idata_phi)
        phi0_new.interpolate_nonmatching(old_phi0, new_cells, idata_phi)

        u_new.interpolate_nonmatching(old_u,   new_cells, idata_u)
        u0_new.interpolate_nonmatching(old_u0, new_cells, idata_u)

        # Write back into mixed sol/sol0
        sol.sub(0).x.array[map_phi]  = phi_new.x.array
        sol.sub(1).x.array[map_u]    = u_new.x.array
        sol0.sub(0).x.array[map_phi] = phi0_new.x.array
        sol0.sub(1).x.array[map_u]   = u0_new.x.array

        sol.x.scatter_forward()
        sol0.x.scatter_forward()

    else:
        # initial conditions (στην αρχή)
        if init_phi is not None:
            sol.sub(0).interpolate(init_phi)
        if init_u is not None:
            sol.sub(1).interpolate(init_u)

        sol.x.scatter_forward()
        sol0.x.array[:] = sol.x.array
        sol0.x.scatter_forward()

    # 3) Weak form (πάντα ξαναχτίζεται στο νέο ME/sol/sol0)
    q, v = ufl.TestFunctions(ME)
    phi, u   = ufl.split(sol)
    phi0, u0 = ufl.split(sol0)

    F1 = (
        ufl.inner(u, v) * ufl.dx
        - ufl.inner(u0, v) * ufl.dx
        + dt * D * ufl.inner(ufl.grad(u), ufl.grad(v)) * ufl.dx
        - 0.5 * ufl.inner(phi, v) * ufl.dx
        + 0.5 * ufl.inner(phi0, v) * ufl.dx
    )

    F2 = (
        ufl.inner(tau(phi0) * phi, q) * ufl.dx
        - ufl.inner(tau(phi0) * phi0, q) * ufl.dx
        + dt * ufl.inner(W(phi0)**2 * ufl.grad(phi), ufl.grad(q)) * ufl.dx
        - dt * ufl.inner(f(phi0, u0), q) * ufl.dx
        + dt * ufl.inner(W(phi0) * omega(phi0), ufl.grad(q)) * ufl.dx
    )

    F = F1 + F2

    # 4) Problem
    problem = NonlinearProblem(F, sol,
                               petsc_options_prefix=petsc_prefix,
                               petsc_options=petsc_options)

    return ME, sol, sol0, problem





def mark_cells_phi_grad(phi: fem.Function, theta: float = 0.5):
    mesh = phi.function_space.mesh
    W = fem.functionspace(mesh, ("DG", 0))

    eta = fem.Function(W)
    ip = W.element.interpolation_points
    if Version(dolfinx.__version__) < Version("0.10.0"):
        ip = ip()
    eta.interpolate(fem.Expression(ufl.inner(ufl.grad(phi), ufl.grad(phi)), ip))

    eta_max = eta.x.petsc_vec.max()[1]

    should_refine = ufl.conditional(ufl.gt(eta, theta * eta_max), 1, 0)

    markers = fem.Function(W)
    markers.interpolate(fem.Expression(should_refine, ip))

    return np.flatnonzero(markers.x.array.astype(np.int32) == 1)


#ΔΗΛΩΣΗ ΤΟΥ ΤΙ OPTIONS ΘΑ ΕΧΕΙ Ο SOLVER #######
use_superlu = PETSc.IntType == np.int64
sys = PETSc.Sys()
if sys.hasExternalPackage("mumps") and not use_superlu:
    linear_solver = "mumps"
elif sys.hasExternalPackage("superlu_dist"):
    linear_solver = "superlu_dist"
else:
    linear_solver = "petsc"

petsc_options = {
    "snes_type": "newtonls",
    "snes_linesearch_type": "basic",
    "snes_atol": 1e-6,
    "snes_rtol": 1e-6,
    "snes_monitor": None,
    "ksp_error_if_not_converged": True,
    "ksp_type": "gmres",
    "ksp_rtol": 1e-8,
    "ksp_monitor": None,
    "pc_type": "hypre",
    "pc_hypre_type": "boomeramg",
    "pc_hypre_boomeramg_max_iter": 1,
    "pc_hypre_boomeramg_cycle_type": "v",
}
###################################################ARXIKI SINTHIKI GIA TO PHI PARAGONTAS PHI ######
def Initial_condition_PHI(x):
    r1 = np.sqrt(x[0]**2 + x[1]**2)
    return -np.tanh((r1 - r0) / np.sqrt(2))
##############ARXIKI ADIASTATIS SIGKENTRWSIS%%%%%%%%%%%%%%%%%
def Initial_condition_U(x):
    r2 = np.sqrt(x[0]**2 + x[1]**2)
    u_init = np.where(r2<=r0,
                      0,
                      -Delta * (1 - np.exp(-(r2 -r0)/2.5))
                      )
    return u_init                  
######################################################ARXIKI DILWSI TOY PROBLIMATOS
ME, sol, sol0, problem = build_system(
    domain, dt, D, petsc_options,
    init_phi=Initial_condition_PHI,
    init_u=Initial_condition_U
)



out_folder = Path("code_3_8cores")
out_folder.mkdir(parents=True, exist_ok=True)
file = XDMFFile(MPI.COMM_WORLD,out_folder / f"output_mesh{mesh_id}.xdmf", "w")
file.write_mesh(domain)
log_path = out_folder / "residuals.txt"
if rank == 0:
    f_log = open(log_path, "w", buffering=1)  # line-buffered
    f_log.write("step,t,snes_reason,snes_its,snes_norm,ksp_its,ksp_rnorm\n")
else:
    f_log = None

###############################
step = 0
# --- Initial refinement passes (before time loop) ---

##############################arxiko refinement###############
gdim = domain.geometry.dim

Vscal = fem.functionspace(domain, ("CG", 1))
Vvec  = fem.functionspace(domain, ("CG", 1, (gdim,)))  # vector space
  # SubFunction is ok for UFL

# ∇phi (vector)

tip_tracker = TipTracker(
    mesh=domain,
    x_min=-200.0,
    x_max=0.0,
    y0=0.0,
    npts=2000,
    out_path=str(out_folder / "tip_velocity.csv"),
    comm=MPI.COMM_WORLD
)


while t < T:
    t += dt
    step += 1
    
    problem.solve()
    sol.x.scatter_forward()
    # diagnostics
    converged_reason = problem.solver.getConvergedReason()
    num_iterations = problem.solver.getIterationNumber()
    snes = problem.solver
    ksp = snes.getKSP()

    snes_reason = snes.getConvergedReason()
    snes_its    = snes.getIterationNumber()

    # SNES residual norm
    snes_norm = snes.getFunctionNorm()

    # KSP stats (τελευταίο linear solve)
    ksp_its   = ksp.getIterationNumber()
    ksp_rnorm = ksp.getResidualNorm()

    # Γράψιμο ΜΕΣΑ στο loop
    if rank == 0:
        f_log.write(f"{step},{t},{snes_reason},{snes_its},{snes_norm},{ksp_its},{ksp_rnorm}\n")
    # 1) διάνυσμα gradient
    phi = sol.sub(0)
    tip_tracker.update(phi, step=step, t=t, dt=dt)
    grad_phi = fem.Function(Vvec, name="grad_phi")
    ip_vec = Vvec.element.interpolation_points
    if Version(dolfinx.__version__) < Version("0.10.0"):
        ip_vec = ip_vec()
    grad_phi.interpolate(fem.Expression(ufl.grad(phi), ip_vec))

    # |∇phi|^2 (scalar)
    grad2 = fem.Function(Vscal, name="grad_phi_sq")
    ip_s = Vscal.element.interpolation_points
    if Version(dolfinx.__version__) < Version("0.10.0"):
        ip_s = ip_s()
    grad2.interpolate(fem.Expression(ufl.inner(ufl.grad(phi), ufl.grad(phi)), ip_s))
    sol0.x.array[:] = sol.x.array
    sol0.x.scatter_forward()
    # Write output
    if step % 150==0:
        phi_out = sol.sub(0)
        u_out   = sol.sub(1)
        file.write_function(phi_out, t)
        file.write_function(u_out, t)
        file.write_function(grad_phi,t)
        file.flush()
tip_tracker.close()
file.close()
if rank == 0 and f_log is not None:
    f_log.close()
if rank == 0:
    print(f"Simulation completed: {step} steps, final time t={t}")

dokken · February 20, 2026, 11:57am

You are not providing a lot of context, as to how many processes you are trying to use, what runtime you get in serial and what you get in parallel (and how many processors you are using).
For instance, running your code, in serial I get (running for T=1):

Runtime: 81.03974407200076

while for 2 processors:

Runtime: 54.94591035600024

and 4 processors:

Runtime: 38.92178050700022

you see that the performance speedup is not optimal and that is likely due to the fact that the problem only has 432112 DOFs, which is a fairly small problem, and the parallel overhead becomes larger when you have less and less work on each processes.

Aris_Giapoutzis · February 20, 2026, 12:38pm

Hi Jørgen,

Thank you very much for your explanation, and sorry for the lack of context in my initial post — you are absolutely right.

To clarify: I am running the code on a university HPC cluster with MPI. With 4 MPI processes and T = 270 (dt = 0.01 → about 27,000 time steps), the simulation takes roughly ~24 hours, while with 8 MPI processes it actually becomes slower (~27–28 hours). The mesh has around 1.8 million cells (P1–P1 mixed formulation).

After reading your explanation, I understand that the problem size may not be large enough to benefit from more MPI ranks, so the parallel overhead likely dominates. That makes sense.

I also wonder whether I might be doing something suboptimal regarding output: I write XDMF files and csv periodically and also compute some additional quantities (gradients, tracking data) during the time loop. Could this I/O or post-processing inside the loop significantly affect scalability? If so, I would appreciate any advice on best practices for output handling in parallel.

Thanks again for your help — your comment clarified the situation for me.

Best regards
Aris Giapoutzis

schnellerhase · February 20, 2026, 2:54pm

Could you check your load balance on the processes, i.e. how many elements per process do you see after refinement mesh.topology.index_map(mesh.topology.dim).size_local? The refineMarkedElements appears to not redistribute the mesh, and you refine localised. That could imply a (with every refinement level increasing) imbalance in cell distribution.

Aris_Giapoutzis · February 20, 2026, 5:41pm

Thanks a lot for the feedback — sorry again for the lack of context earlier.

I checked the load balance after refinement as suggested. For 4 MPI ranks I get:

Local cells per rank: [107999, 108007, 108003, 108037]
max/mean ≈ 1.00024 (so distribution seems very well balanced)

I also ran a short timing test on my university HPC cluster for T=1 (dt=0.01 → 100 steps). With 4 MPI ranks I get:

Elapsed time ≈ 211 s for 100 steps @dokken

This is significantly larger than the timing you reported (~39 s for 4 processes), so I suspect my slowdown is due to environment differences .I dont know .

Thanks again

schnellerhase · February 20, 2026, 6:50pm

That’s perfectly distributed. For completeness: indeed refineMarkedElements repartitions (None partitioner implies not a None partitioner being passed to create_mesh ngsPETSc/ngsPETSc/utils/fenicsx.py at 27d59136ad905dd1512ae6de0e0656e86a21b77c · NGSolve/ngsPETSc · GitHub), I missed that before.

Aris_Giapoutzis · February 20, 2026, 7:04pm

Hi Dokken
I did some additional tests on my university SLURM cluster and here are more complete details.

Environment

I am running via an sbatch script on 1 node and launching with:

mpirun -np N python Nonadaptive_Ph.py

Both python and mpirun come from the same conda environment (fenicsx-env).
mpirun reports MPICH (HYDRA build).

Load balance

After refinement I checked the element distribution:

Local cells per rank:
[107999, 108007, 108003, 108037]

max/mean ≈ 1.00024

So the mesh is extremely well balanced — imbalance does not seem to be the issue.

Timing results (T = 1, dt = 0.01 → 100 steps)

On the cluster:

2 MPI ranks → ~129 s
4 MPI ranks → ~211 s

So in my setup, 2 processes are actually faster than 4.

This suggests that beyond 2 ranks, communication and solver synchronization overhead dominate for this problem size.

For comparison, you reported ~39 s with 4 processes for T=1, so my runtime is significantly higher overall.

Thank you again

dokken · February 20, 2026, 8:10pm

I simply ran your code on my Dell xps 13 laptop (using dolfinx through docker). I’ll rerun it on my dell xps 15 tomorrow.

Aris_Giapoutzis · February 20, 2026, 10:41pm

Thank you very much for your continued help — your guidance has been extremely useful.

I have now enabled timing both with MPI.Wtime() and with dolfinx.common.list_timings(). I ran the same case using 2 and 4 MPI processes and attached screenshots of the results.

The 4-core run is slower as i said ,

than the 2-core run (around ~254 s with 4 processes vs ~131 s with 2 processes for 100 steps).

For completeness, I am also including the full list of packages from my conda environment in case there could be any dependency-related issue.

_openmp_mutex                4.5              7_kmp_llvm             conda-forge
_python_abi3_support         1.0              hd8ed1ab_2             conda-forge
_x86_64-microarch-level      3                3_zen2                 conda-forge
aiohappyeyeballs             2.6.1            pyhd8ed1ab_0           conda-forge
aiohttp                      3.13.3           py311h55b9665_0        conda-forge
aiosignal                    1.4.0            pyhd8ed1ab_0           conda-forge
alsa-lib                     1.2.15.3         hb03c661_0             conda-forge
attr                         2.5.2            h39aace5_0             conda-forge
attrs                        25.4.0           pyhcf101f3_1           conda-forge
backports.zstd               1.3.0            py311h6b1f9c4_0        conda-forge
binutils_impl_linux-64       2.45             default_hfdba357_105   conda-forge
binutils_linux-64            2.45             default_h4852527_105   conda-forge
blosc                        1.21.6           he440d0b_1             conda-forge
brotli                       1.2.0            hed03a55_1             conda-forge
brotli-bin                   1.2.0            hb03c661_1             conda-forge
brotli-python                1.2.0            py311h66f275b_1        conda-forge
bzip2                        1.0.8            hda65f42_8             conda-forge
c-ares                       1.34.6           hb03c661_0             conda-forge
c-blosc2                     2.22.0           hc31b594_1             conda-forge
ca-certificates              2026.1.4         hbd8a1cb_0             conda-forge
cairo                        1.18.4           h3394656_0             conda-forge
certifi                      2026.1.4         pyhd8ed1ab_0           conda-forge
cffi                         2.0.0            py311h03d9500_1        conda-forge
charset-normalizer           3.4.4            pyhd8ed1ab_0           conda-forge
cli11                        2.6.0            h54a6638_0             conda-forge
contourpy                    1.3.3            py311hdf67eae_3        conda-forge
cpython                      3.11.14          py311hd8ed1ab_2        conda-forge
cycler                       0.12.1           pyhcf101f3_2           conda-forge
cyrus-sasl                   2.1.28           hd9c7081_0             conda-forge
dbus                         1.16.2           h24cb091_1             conda-forge
double-conversion            3.3.1            h5888daf_0             conda-forge
fenics-basix                 0.10.0           py311habd6e24_0        conda-forge
fenics-basix-nanobind-abi    2.16             py310h06e8995_0        conda-forge
fenics-dolfinx               0.10.0           py311h738bb83_101      conda-forge
fenics-ffcx                  0.10.1           pyh1e3ed43_0           conda-forge
fenics-libbasix              0.10.0           py311h8a41e2e_0        conda-forge
fenics-libdolfinx            0.10.0           py311h7b9e3d9_101      conda-forge
fenics-ufcx                  0.10.1           haec879b_0             conda-forge
fenics-ufl                   2025.2.1         pyhcf101f3_0           conda-forge
fftw                         3.3.10           mpi_mpich_h084ba78_11  conda-forge
fmt                          11.2.0           h07f6e7f_0             conda-forge
font-ttf-dejavu-sans-mono    2.37             hab24e00_0             conda-forge
font-ttf-inconsolata         3.000            h77eed37_0             conda-forge
font-ttf-source-code-pro     2.038            h77eed37_0             conda-forge
font-ttf-ubuntu              0.83             h77eed37_3             conda-forge
fontconfig                   2.15.0           h7e30c49_1             conda-forge
fonts-conda-ecosystem        1                0                      conda-forge
fonts-conda-forge            1                hc364b38_1             conda-forge
fonttools                    4.61.1           py311h3778330_0        conda-forge
freetype                     2.14.1           ha770c72_0             conda-forge
frozenlist                   1.7.0            py311h52bc045_0        conda-forge
gcc_impl_linux-64            14.3.0           he8b2097_16            conda-forge
gcc_linux-64                 14.3.0           h298d278_17            conda-forge
gl2ps                        1.4.2            hae5d5c5_1             conda-forge
graphite2                    1.3.14           hecca717_2             conda-forge
h2                           4.3.0            pyhcf101f3_0           conda-forge
harfbuzz                     12.2.0           h15599e2_0             conda-forge
hdf4                         4.2.15           h2a13503_7             conda-forge
hdf5                         1.14.6           mpi_mpich_h02efad0_5   conda-forge
hpack                        4.1.0            pyhd8ed1ab_0           conda-forge
hyperframe                   6.1.0            pyhd8ed1ab_0           conda-forge
hypre                        2.32.0           mpi_mpich_h2e71eac_1   conda-forge
icu                          75.1             he02047a_0             conda-forge
idna                         3.11             pyhd8ed1ab_0           conda-forge
intel-cmplr-lib-ur           2025.3.1         pypi_0                 pypi
jsoncpp                      1.9.6            hf42df4d_1             conda-forge
kahip                        3.19             h0ca329d_1             conda-forge
kernel-headers_linux-64      5.14.0           he073ed8_3             conda-forge
keyutils                     1.6.3            hb9d3cd8_0             conda-forge
kiwisolver                   1.4.9            py311h724c32c_2        conda-forge
krb5                         1.21.3           h659f571_0             conda-forge
lcms2                        2.18             h0c24ade_0             conda-forge
ld_impl_linux-64             2.45             default_hbd61a6d_105   conda-forge
lerc                         4.0.0            h0aef613_1             conda-forge
libadios2                    2.10.2           mpi_mpich_h17ec5a4_14  conda-forge
libaec                       1.1.4            h3f801dc_0             conda-forge
libamd                       3.3.3            haaf9dc3_7100102       conda-forge
libblas                      3.11.0           5_h4a7cf45_openblas    conda-forge
libboost                     1.88.0           hed09d94_6             conda-forge
libboost-devel               1.88.0           hfcd1e18_6             conda-forge
libboost-headers             1.88.0           ha770c72_6             conda-forge
libbrotlicommon              1.2.0            hb03c661_1             conda-forge
libbrotlidec                 1.2.0            hb03c661_1             conda-forge
libbrotlienc                 1.2.0            hb03c661_1             conda-forge
libbtf                       2.3.2            h32481e8_7100102       conda-forge
libcamd                      3.3.3            h32481e8_7100102       conda-forge
libcap                       2.77             h3ff7636_0             conda-forge
libcblas                     3.11.0           5_h0358290_openblas    conda-forge
libccolamd                   3.3.4            h32481e8_7100102       conda-forge
libcholmod                   5.3.1            h59ddab4_7100102       conda-forge
libclang-cpp21.1             21.1.8           default_h99862b1_1     conda-forge
libclang13                   21.1.8           default_h746c552_1     conda-forge
libcolamd                    3.3.4            h32481e8_7100102       conda-forge
libcups                      2.3.3            hb8b1518_5             conda-forge
libcurl                      8.18.0           h4e3cde8_0             conda-forge
libdeflate                   1.25             h17f619e_0             conda-forge
libdrm                       2.4.125          hb03c661_1             conda-forge
libedit                      3.1.20250104     pl5321h7949ede_0       conda-forge
libegl                       1.7.0            ha4b6fd6_2             conda-forge
libev                        4.33             hd590300_2             conda-forge
libexpat                     2.7.3            hecca717_0             conda-forge
libfabric                    2.4.0            ha770c72_0             conda-forge
libfabric1                   2.4.0            h6c8fc0a_0             conda-forge
libffi                       3.5.2            h9ec8514_0             conda-forge
libfreetype                  2.14.1           ha770c72_0             conda-forge
libfreetype6                 2.14.1           h73754d4_0             conda-forge
libgcc                       15.2.0           he0feb66_16            conda-forge
libgcc-devel_linux-64        14.3.0           hf649bbc_116           conda-forge
libgcc-ng                    15.2.0           h69a702a_16            conda-forge
libgfortran                  15.2.0           h69a702a_16            conda-forge
libgfortran5                 15.2.0           h68bc16d_16            conda-forge
libgl                        1.7.0            ha4b6fd6_2             conda-forge
libglib                      2.86.2           h32235b2_0             conda-forge
libglu                       9.0.3            h5888daf_1             conda-forge
libglvnd                     1.7.0            ha4b6fd6_2             conda-forge
libglx                       1.7.0            ha4b6fd6_2             conda-forge
libgomp                      15.2.0           he0feb66_16            conda-forge
libhwloc                     2.12.2           default_hafda6a7_1000  conda-forge
libiconv                     1.18             h3b78370_2             conda-forge
libjpeg-turbo                3.1.2            hb03c661_0             conda-forge
libklu                       2.3.5            hf24d653_7100102       conda-forge
liblapack                    3.11.0           5_h47877c9_openblas    conda-forge
libllvm21                    21.1.8           hf7376ad_0             conda-forge
liblzma                      5.8.2            hb03c661_0             conda-forge
libnetcdf                    4.9.3            nompi_h11f7409_103     conda-forge
libnghttp2                   1.67.0           had1ee68_0             conda-forge
libnl                        3.11.0           hb9d3cd8_0             conda-forge
libnsl                       2.0.1            hb9d3cd8_1             conda-forge
libntlm                      1.8              hb9d3cd8_0             conda-forge
libogg                       1.3.5            hd0c01bc_1             conda-forge
libopenblas                  0.3.30           openmp_hd680484_4      conda-forge
libopengl                    1.7.0            ha4b6fd6_2             conda-forge
libpciaccess                 0.18             hb9d3cd8_0             conda-forge
libpng                       1.6.54           h421ea60_0             conda-forge
libpq                        18.1             h5c52fec_2             conda-forge
libptscotch                  7.0.10           int32_ha171f39_2       conda-forge
libsanitizer                 14.3.0           h8f1669f_16            conda-forge
libscotch                    7.0.10           int32_h8512f2c_2       conda-forge
libsodium                    1.0.20           h4ab18f5_0             conda-forge
libspqr                      4.3.4            h852d39f_7100102       conda-forge
libsqlite                    3.51.2           h0c1763c_0             conda-forge
libssh2                      1.11.1           hcf80075_0             conda-forge
libstdcxx                    15.2.0           h934c35e_16            conda-forge
libstdcxx-devel_linux-64     14.3.0           h9f08a49_116           conda-forge
libstdcxx-ng                 15.2.0           hdf11a46_16            conda-forge
libsuitesparseconfig         7.10.1           h92d6892_7100102       conda-forge
libsystemd0                  258.3            h6569c3e_0             conda-forge
libtheora                    1.1.1            h4ab18f5_1006          conda-forge
libtiff                      4.7.1            h9d88235_1             conda-forge
libudev1                     258.3            h6569c3e_0             conda-forge
libumfpack                   6.3.5            heb53515_7100102       conda-forge
libuuid                      2.41.3           h5347b49_0             conda-forge
libvorbis                    1.3.7            h54a6638_2             conda-forge
libvulkan-loader             1.4.328.1        h5279c79_0             conda-forge
libwebp-base                 1.6.0            hd42ef1d_0             conda-forge
libxcb                       1.17.0           h8a09558_0             conda-forge
libxcrypt                    4.4.36           hd590300_1             conda-forge
libxkbcommon                 1.13.1           hca5e8e5_0             conda-forge
libxml2                      2.15.1           h26afc86_0             conda-forge
libxml2-16                   2.15.1           ha9997c6_0             conda-forge
libzip                       1.11.2           h6991a6a_0             conda-forge
libzlib                      1.3.1            hb9d3cd8_2             conda-forge
llvm-openmp                  21.1.8           h4922eb0_0             conda-forge
loguru                       0.7.3            pyh707e725_0           conda-forge
lz4-c                        1.10.0           h5888daf_1             conda-forge
matplotlib-base              3.10.8           py311h0f3be63_0        conda-forge
metis                        5.1.0            hd0bcaf9_1007          conda-forge
mpi                          1.0.1            mpich                  conda-forge
mpi4py                       4.1.1            py311h2a725c7_102      conda-forge
mpich                        4.3.2            h8269dce_102           conda-forge
msgpack-python               1.1.2            py311hdf67eae_1        conda-forge
multidict                    6.7.0            py311h3778330_0        conda-forge
mumps-include                5.8.1            h580308f_4             conda-forge
mumps-mpi                    5.8.1            h2a0ffd2_4             conda-forge
munkres                      1.1.4            pyhd8ed1ab_1           conda-forge
nanobind-abi                 16               hd8ed1ab_1             conda-forge
ncurses                      6.5              h2d0b736_3             conda-forge
netgen-mesher                6.2.2601         pypi_0                 pypi
netgen-occt                  7.8.1            pypi_0                 pypi
ngsolve                      6.2.2601         pypi_0                 pypi
ngspetsc                     0.1.1            pypi_0                 pypi
nlohmann_json                3.12.0           h54a6638_1             conda-forge
numpy                        2.4.1            py311h2e04523_0        conda-forge
openjpeg                     2.5.4            h55fea9a_0             conda-forge
openldap                     2.6.10           he970967_0             conda-forge
openssl                      3.6.0            h26f9b46_0             conda-forge
packaging                    25.0             pyh29332c3_1           conda-forge
parmetis                     4.0.3            hc7bef4e_1007          conda-forge
pcre2                        10.46            h1321c63_0             conda-forge
petsc                        3.24.3           real_h383672b_1        conda-forge
petsc4py                     3.24.3           np2py310h4c8d486_0     conda-forge
pillow                       12.1.0           py311hf88fc01_0        conda-forge
pip                          25.3             pyh8b19718_0           conda-forge
pixman                       0.46.4           h54a6638_1             conda-forge
pkg-config                   0.29.2           h4bc722e_1009          conda-forge
platformdirs                 4.5.1            pyhcf101f3_0           conda-forge
pooch                        1.8.2            pyhd8ed1ab_3           conda-forge
proj                         9.7.1            h99ae125_0             conda-forge
propcache                    0.3.1            py311h2dc5d0c_0        conda-forge
pthread-stubs                0.4              hb9d3cd8_1002          conda-forge
pugixml                      1.15             h3f63f65_0             conda-forge
pycparser                    2.22             pyh29332c3_1           conda-forge
pyparsing                    3.3.1            pyhcf101f3_0           conda-forge
pysocks                      1.7.1            pyha55dd90_7           conda-forge
python                       3.11.14          hd63d673_2_cpython     conda-forge
python-dateutil              2.9.0.post0      pyhe01879c_2           conda-forge
python-gil                   3.11.14          hd8ed1ab_2             conda-forge
python_abi                   3.11             8_cp311                conda-forge
pyvista                      0.46.5           pyhd8ed1ab_0           conda-forge
qhull                        2020.2           h434a139_5             conda-forge
qt6-main                     6.9.3            h5c1c036_1             conda-forge
rdma-core                    60.0             hecca717_0             conda-forge
readline                     8.3              h853b02a_0             conda-forge
requests                     2.32.5           pyhcf101f3_1           conda-forge
scalapack                    2.2.0            h6b1b6a3_5             conda-forge
scipy                        1.17.0           pypi_0                 pypi
scooby                       0.11.0           pyhd8ed1ab_0           conda-forge
setuptools                   80.9.0           pyhff2d567_0           conda-forge
six                          1.17.0           pyhe01879c_1           conda-forge
slepc                        3.24.1           real_h25784b2_0        conda-forge
slepc4py                     3.24.1           np2py310h004e9ba_0     conda-forge
snappy                       1.2.2            h03e3b7b_1             conda-forge
spdlog                       1.15.3           h6dc744f_1             conda-forge
sqlite                       3.51.2           hbc0de68_0             conda-forge
superlu                      7.0.1            h8f6e6c4_0             conda-forge
superlu_dist                 9.1.0            h0804ebd_0             conda-forge
sysroot_linux-64             2.34             h087de78_3             conda-forge
tbb                          2022.3.0         pypi_0                 pypi
tcmlib                       1.4.1            pypi_0                 pypi
tk                           8.6.13           noxft_ha0e22de_103     conda-forge
typing-extensions            4.15.0           h396c80c_0             conda-forge
typing_extensions            4.15.0           pyhcf101f3_0           conda-forge
tzdata                       2025c            hc9c84f9_1             conda-forge
ucx                          1.19.1           h567e125_0             conda-forge
umf                          1.0.2            pypi_0                 pypi
unicodedata2                 17.0.0           py311h49ec1c0_1        conda-forge
urllib3                      2.6.3            pyhd8ed1ab_0           conda-forge
utfcpp                       4.09             ha770c72_0             conda-forge
vtk-base                     9.5.2            py311h5228f16_1        conda-forge
wayland                      1.24.0           hd6090a7_1             conda-forge
wheel                        0.45.1           pyhd8ed1ab_1           conda-forge
wslink                       2.5.0            pyhd8ed1ab_0           conda-forge
xcb-util                     0.4.1            h4f16b4b_2             conda-forge
xcb-util-cursor              0.1.6            hb03c661_0             conda-forge
xcb-util-image               0.4.0            hb711507_2             conda-forge
xcb-util-keysyms             0.4.1            hb711507_0             conda-forge
xcb-util-renderutil          0.3.10           hb711507_0             conda-forge
xcb-util-wm                  0.4.2            hb711507_0             conda-forge
xkeyboard-config             2.46             hb03c661_0             conda-forge
xorg-libice                  1.1.2            hb9d3cd8_0             conda-forge
xorg-libsm                   1.2.6            he73a12e_0             conda-forge
xorg-libx11                  1.8.12           h4f16b4b_0             conda-forge
xorg-libxau                  1.0.12           hb03c661_1             conda-forge
xorg-libxcomposite           0.4.6            hb9d3cd8_2             conda-forge
xorg-libxcursor              1.2.3            hb9d3cd8_0             conda-forge
xorg-libxdamage              1.1.6            hb9d3cd8_0             conda-forge
xorg-libxdmcp                1.1.5            hb03c661_1             conda-forge
xorg-libxext                 1.3.6            hb9d3cd8_0             conda-forge
xorg-libxfixes               6.0.2            hb03c661_0             conda-forge
xorg-libxi                   1.8.2            hb9d3cd8_0             conda-forge
xorg-libxrandr               1.5.4            hb9d3cd8_0             conda-forge
xorg-libxrender              0.9.12           hb9d3cd8_0             conda-forge
xorg-libxtst                 1.2.5            hb9d3cd8_3             conda-forge
xorg-libxxf86vm              1.1.6            hb9d3cd8_0             conda-forge
yaml                         0.2.5            h280c20c_3             conda-forge
yarl                         1.22.0           py311h3778330_0        conda-forge
zeromq                       4.3.5            h387f397_9             conda-forge
zfp                          1.0.1            h909a3a2_5             conda-forge
zlib                         1.3.1            hb9d3cd8_2             conda-forge
zlib-ng                      2.3.2            hceb46e0_1             conda-forge
zstd                         1.5.7            hb78ec9c_6             conda-forge

Thank you again for your time and support.

dokken · February 21, 2026, 12:17pm

A few points here:

Using conda on HPC systems is not recommended, as you are not utilizing the optimized builds for your problem.
The timings you show doesn’t show any of the expensive commands of your code.

If you modify your code so that you do not allocate new data (or search through caches) within your loop, you have:

ip_vec = Vvec.element.interpolation_points
if Version(dolfinx.__version__) < Version("0.10.0"):
    ip_vec = ip_vec()
phi = sol.sub(0)
gphi_expr = fem.Expression(ufl.grad(phi), ip_vec)
grad_phi = fem.Function(Vvec, name="grad_phi")
ip_s = Vscal.element.interpolation_points
if Version(dolfinx.__version__) < Version("0.10.0"):
    ip_s = ip_s()
grad2 = fem.Function(Vscal, name="grad_phi_sq")
grad2_expr = fem.Expression(ufl.inner(ufl.grad(phi), ufl.grad(phi)), ip_s)
import time

solve_start = time.perf_counter()
snes_time = 0
while t < T:
    t += dt
    step += 1
    start = time.perf_counter()
    problem.solve()
    end = time.perf_counter()
    snes_time += end - start
    print(t, end - start)
    sol.x.scatter_forward()
    # diagnostics
    converged_reason = problem.solver.getConvergedReason()
    num_iterations = problem.solver.getIterationNumber()
    snes = problem.solver
    ksp = snes.getKSP()

    snes_reason = snes.getConvergedReason()
    snes_its = snes.getIterationNumber()

    # SNES residual norm
    snes_norm = snes.getFunctionNorm()

    # KSP stats (τελευταίο linear solve)
    ksp_its = ksp.getIterationNumber()
    ksp_rnorm = ksp.getResidualNorm()

    # Γράψιμο ΜΕΣΑ στο loop
    if rank == 0:
        f_log.write(
            f"{step},{t},{snes_reason},{snes_its},{snes_norm},{ksp_its},{ksp_rnorm}\n"
        )
    # 1) διάνυσμα gradient
    tip_tracker.update(phi, step=step, t=t, dt=dt)
    grad_phi.interpolate(gphi_expr)

    # |∇phi|^2 (scalar)
    grad2.interpolate(grad2_expr)
    sol0.x.array[:] = sol.x.array
    sol0.x.scatter_forward()
    # Write output
    if step % 150 == 0:
        phi_out = sol.sub(0)
        u_out = sol.sub(1)
        file.write_function(phi_out, t)
        file.write_function(u_out, t)
        file.write_function(grad_phi, t)
        file.flush()
solve_end = time.perf_counter()
print(f"All time steps {step} in {solve_end - solve_start} (SNES {snes_time})")

and with petsc options

petsc_options = {
    "snes_type": "newtonls",
    "snes_error_if_not_converged": True,
    "snes_linesearch_type": "basic",
    "snes_atol": 1e-6,
    "snes_rtol": 1e-6,
    # "snes_monitor": None,
    # "ksp_monitor": None,
    "ksp_error_if_not_converged": True,
    "ksp_type": "gmres",
    "ksp_rtol": 1e-8,
    "pc_type": "hypre",
    "pc_hypre_type": "boomeramg",
    "pc_hypre_boomeramg_max_iter": 1,
    "pc_hypre_boomeramg_cycle_type": "v",
}

you’ll see that each PETScSNES solve takes roughly ~1 second (per time step),
while the whole temporal loop (for 100 steps, 1 process) is:

All time steps 100 in 108.42281524699865 (SNES 97.35913112396156)

meaning that 90 % of the runtime is within solve (including assembly).

while with two processes

All time steps 100 in 68.69373398499738 (SNES 61.60968295697603)

Further analysis on how to speed up your code (without tuning the solver), can be done by using optimized jit_parameters for the assembly (see: The FEniCSx tutorial — FEniCSx tutorial),
which amounts to adding:

 jit_options = {
        "cffi_extra_compile_args": ["-march=native", "-O3"],
        "cffi_libraries": ["m"],
    }
    # 4) Problem
    problem = NonlinearProblem(
        F,
        sol,
        petsc_options_prefix=petsc_prefix,
        petsc_options=petsc_options,
        jit_options=jit_options,
    )

(you can also add these options to the dolfinx.fem.Expression.
On my laptop this had minimal effect (serial):

All time steps 100 in 101.9859383059993 (SNES 91.30520566801351)

Furthermore, you can consider performing a “variational crime” of pinning the quadrature degree:

i.e.

  dx = ufl.Measure("dx", domain=domain, metadata={})
    F1 = (
        ufl.inner(u, v) * dx
        - ufl.inner(u0, v) * dx
        + dt * D * ufl.inner(ufl.grad(u), ufl.grad(v)) * dx
        - 0.5 * ufl.inner(phi, v) * dx
        + 0.5 * ufl.inner(phi0, v) * dx
    )

    F2 = (
        ufl.inner(tau(phi0) * phi, q) * dx
        - ufl.inner(tau(phi0) * phi0, q) * dx
        + dt * ufl.inner(W(phi0) ** 2 * ufl.grad(phi), ufl.grad(q)) * dx
        - dt * ufl.inner(f(phi0, u0), q) * dx
        + dt * ufl.inner(W(phi0) * omega(phi0), ufl.grad(q)) * dx
    )

    F = F1 + F2

    pulled_back_F = ufl.algorithms.compute_form_data(
        F,
        do_apply_function_pullbacks=True,
        do_apply_integral_scaling=True,
        do_apply_geometry_lowering=True,
        preserve_geometry_types=(ufl.classes.Jacobian,),
    )
    print("F", list(itg.metadata() for itg in pulled_back_F.integral_data[0].integrals))
    J = ufl.derivative(F, sol)
    pulled_back_J = ufl.algorithms.compute_form_data(
        J,
        do_apply_function_pullbacks=True,
        do_apply_integral_scaling=True,
        do_apply_geometry_lowering=True,
        preserve_geometry_types=(ufl.classes.Jacobian,),
    )
    print("J", list(itg.metadata() for itg in pulled_back_J.integral_data[0].integrals))

yielding

F [{'estimated_polynomial_degree': 6}]
J [{'estimated_polynomial_degree': 2}]

which indicates that there is not that much to save from modifying the quadrature (at least of the jacobian).
You could do it for the residual with

    F = ufl.form.Form(
        [itg.reconstruct(metadata={"quadrature_degree": 3}) for itg in F.integrals()]
    )

(with this I saw very little speedup/effect).
Thus I would say that he way you can get the most speedup is by tuning the non-linear solver parameters.

dokken · February 21, 2026, 12:38pm

Furthermore, you can replace the mixed_element with blocked_element, i.e.

from basix.ufl import element, blocked_element
ME = fem.functionspace(domain, blocked_element(P1, shape=(2,)))

which means that you can use the VTXWriter, i.e.

vtx = dolfinx.io.VTXWriter(
    domain.comm, "u.bp", [sol], mesh_policy=dolfinx.io.VTXMeshPolicy.reuse
)

and then comparing runtimes for storing data:

    start_xdmf = time.perf_counter()
    phi_out = sol.sub(0)
    u_out = sol.sub(1)
    file.write_function(phi_out, t)
    file.write_function(u_out, t)
    # file.write_function(grad_phi, t)
    file.flush()
    end_xdmf = time.perf_counter()
    start_vtx = time.perf_counter()
    vtx.write(t)
    end_vtx = time.perf_counter()
    print((end_xdmf - start_xdmf) / (end_vtx - start_vtx))

I at least get a ratio around 2, i.e. it takes half the time to store it to vtx (roughly 0.01 seconds per time steps, so it is not really alot of time saved).

Aris_Giapoutzis · February 22, 2026, 1:50pm

Thank you so much for taking the time to look into this and for the detailed suggestions — they’ve helped a lot, and I now have a clearer direction on how to proceed.

I have one last question about the timing discrepancy: I ran the same code with 4 cores both on our HPC (1 node) and on my own laptop, and I consistently get ~60 s for T=1, whereas you reported ~30 s on your Dell XPS (running dolfinx through Docker). Do you think this could plausibly indicate an installation/build issue on my side, or is it more likely just hardware/build differences (CPU single-core performance, compilation flags, filesystem, etc.)?

For context, in my setup some components (e.g. netgen/ngsolve) were installed via pip in the past — could something like that realistically affect the solve time, or is it unrelated since most of the runtime is in PETScSNES/KSP/assembly?

Thanks again for all your help.

Best regards,
Aris

dokken · February 22, 2026, 2:26pm

I think how you’ve set up your code on the HPC system (i.e. DOLFINx) could be problematic/causing a slowdown.
I’m not running on a bleeding edge computer (my Dell XPS 15 is 8 years old), thus you should see better performance on an HPC system than on my laptop.

Using conda on hpc is in general not recommended, as conda pre-builds binaries that are downloaded to the system, rather than building them from scratch using the HPC system’s compilers and MPI.

See for instance: GitHub - FEniCS/spack-fenics: Spack package repository overlay for FEniCS
on how to install DOLFINx with spack (using over overlay repo).

I find ngspetsc/netgen-mesher hard to install in any other way than with pip.
What I would recommend is to split your code in 2,

Mesh generation with netgen (store mesh and tags to an .xdmf file through dolfinx.io.XDMFFile
Read this file into your main script (as you will then get the best load balancing and you wouldn’t have to re-generate your mesh every time you try to solve the problem).

Aris_Giapoutzis · February 22, 2026, 2:40pm

Thank you again for all your input — I really appreciate it.

On my side I also get ~60 seconds for T=1 on my own laptop (an ASUS TUF Gaming machine 2021), so it’s not only the HPC system that is slower. That’s why I’m trying to understand what might differ between our setups, since you’re seeing ~30 seconds on your XPS through Docker.

I’ll go through my setup carefully once more and double-check solver options, DOFs, and output frequency to make sure everything is truly identical. In parallel, I’ll also try installing DOLFINx via Spack on the HPC system to see whether a fully optimized build makes a noticeable difference.

Thanks again for all your guidance — it has been very helpful in clarifying how to proceed.

Best regards,
Aris

Topic		Replies	Views
Multiprocessing FEniCSx making simulation slower dolfinx	7	1245	April 29, 2023
MPI acceleration with FEniCSx General	13	308	January 24, 2025
How to further speed up the calculation in doflinx?	5	1038	February 26, 2025
The programme is slower in HPC computer installation mpi	13	201	March 6, 2025
Parallel slower runtime, MPI Anaconda dolfinx	7	498	January 17, 2024

Slower performance with increasing MPI ranks

Environment

Load balance

Timing results (T = 1, dt = 0.01 → 100 steps)

Related topics