Inefficient MixedElement Hyperelastic solver

Hey Team,

Seeking advice on code efficiency.

I am attempting to solve a hyperelastic mixed-element problem over different displacement quantities. However, even with this very simple MWE the code is quite slow.

MWE:

# βˆ† Dolfin
import numpy as np
import ufl
from mpi4py import MPI
from basix.ufl import element, mixed_element
from dolfinx import log, io,  default_scalar_type, mesh
from dolfinx.fem import Function, functionspace, dirichletbc, locate_dofs_topological
from dolfinx.fem.petsc import NonlinearProblem
from dolfinx.nls.petsc import NewtonSolver
from ufl import (Identity, grad, TestFunctions, split, det, as_tensor, variable, dx)

# βˆ† Parameters
DIM = 3
ORDER = 2
TOL = 1e-5
QUAD_DEGREE = 4

# βˆ† Mesh
domain = mesh.create_unit_cube(MPI.COMM_WORLD, 10, 10, 10)
domain.name = "mesh"

# βˆ† Function space
P2 = element("Lagrange", domain.basix_cell(), ORDER, shape=(domain.geometry.dim,))
P1 = element("Lagrange", domain.basix_cell(), ORDER-1)
Mxs = functionspace(mesh=domain, element=mixed_element([P2, P1]))
Tes = functionspace(mesh=domain, element=("Lagrange", ORDER, (DIM, DIM)))

# βˆ† Define subdomains
V, _ = Mxs.sub(0).collapse()

# βˆ† Trial & Test
u_p = Function(Mxs)
v, q = TestFunctions(Mxs)
u, p = split(u_p)

# βˆ† Kinematics
I = Identity(DIM)
F = variable(I + grad(u))
C = variable(F.T * F)
E = as_tensor(0.5 * (C - I))
J = det(F)

# βˆ† Kinematics 
I = ufl.variable(ufl.Identity(DIM))
F = ufl.variable(I + ufl.grad(u))
C = ufl.variable(F.T * F)
Ic = ufl.variable(ufl.tr(C))
IIc = ufl.variable((Ic**2 - ufl.inner(C,C))/2)
J = ufl.variable(ufl.det(F))
psi = 1 * (Ic - 3) + 0 *(IIc - 3) 
gamma1 = ufl.diff(psi, Ic) + Ic * ufl.diff(psi, IIc)
gamma2 = -ufl.diff(psi, IIc)
firstPK = 2 * F * (gamma1*I + gamma2*C) + p * J * ufl.inv(F).T
cau = (1/J * firstPK * F).T

# βˆ† Residual
dx = ufl.Measure(integral_type="dx", domain=domain, metadata={"quadrature_degree": QUAD_DEGREE})
R = ufl.inner(ufl.grad(v), firstPK) * dx + q * (J - 1) * dx

# βˆ† Solver
problem = NonlinearProblem(R, u_p, [])
solver = NewtonSolver(MPI.COMM_WORLD, problem)
solver.atol = TOL
solver.rtol = TOL
solver.max_it = 50
solver.convergence_criterion = "incremental"

# βˆ† Output
displacement = Function(V)
displacement.name = "Displacement"
xdmf = io.VTXWriter(domain.comm, "disp.bp", displacement, engine="BP4")

X, Y, Z = 0, 1, 2

def left(x):
    return np.isclose(x[0], 0)

def right(x):
    return np.isclose(x[0], max(x[0]))

fdim = domain.topology.dim - 1
left_facets = mesh.locate_entities_boundary(domain, fdim, left)
right_facets = mesh.locate_entities_boundary(domain, fdim, right)
marked_facets = np.hstack([left_facets, right_facets])
marked_values = np.hstack([np.full_like(left_facets, 1), np.full_like(right_facets, 2)])
sorted_facets = np.argsort(marked_facets)
facet_tag = mesh.meshtags(domain, fdim, marked_facets[sorted_facets], marked_values[sorted_facets])

# βˆ† Setup boundary terms
xx0 = locate_dofs_topological(Mxs.sub(0).sub(X), domain.topology.dim - 1, facet_tag.find(1))
xx1 = locate_dofs_topological(Mxs.sub(0).sub(X), domain.topology.dim - 1, facet_tag.find(2))
yx0 = locate_dofs_topological(Mxs.sub(0).sub(Y), domain.topology.dim - 1, facet_tag.find(1))
yx1 = locate_dofs_topological(Mxs.sub(0).sub(Y), domain.topology.dim - 1, facet_tag.find(2))
zx0 = locate_dofs_topological(Mxs.sub(0).sub(Z), domain.topology.dim - 1, facet_tag.find(1))
zx1 = locate_dofs_topological(Mxs.sub(0).sub(Z), domain.topology.dim - 1, facet_tag.find(2))

# βˆ† Set boundaries
d_yy0 = dirichletbc(default_scalar_type(0.0), yx0, Mxs.sub(0).sub(Y))
d_yy1 = dirichletbc(default_scalar_type(0.0), yx1, Mxs.sub(0).sub(Y))
d_zy0 = dirichletbc(default_scalar_type(0.0), zx0, Mxs.sub(0).sub(Z))
d_zy1 = dirichletbc(default_scalar_type(0.0), zx1, Mxs.sub(0).sub(Z))

log.set_log_level(log.LogLevel.INFO)

# βˆ† Iterate
for ii, kk in enumerate([0, 5, 10, 15, 20]):

    # βˆ† Apply displacements as boundary conditions
    du = kk / 100
    d_xy0 = dirichletbc(default_scalar_type(du), xx0, Mxs.sub(0).sub(X))
    d_xy1 = dirichletbc(default_scalar_type(0.0), xx1, Mxs.sub(0).sub(X))
    bc = [d_xy0, d_yy0, d_zy0, d_xy1, d_yy1, d_zy1]
    problem.bcs = bc

    # βˆ† Solve
    try:
        n_iter, conv = solver.solve(u_p)
        print(f"Solved {kk}%: iterations={n_iter}, residual={conv}")
    except RuntimeError:
        print(f"Failed to converge at {kk}%")
        break

    # βˆ† Store output
    displacement.interpolate(u_p.sub(0).collapse())
    xdmf.write(ii)

xdmf.close()

I have access to quite a powerful HPC for the real problem but solve times have exceeded 24 hours. I assume there is some level of computational ignorance on my end.

Appreciate any advice :slight_smile:


Versions:

fenics-basix              0.9.0           py313hd7981f8_2    conda-forge
fenics-basix-nanobind-abi 0.2.1.18             h9b63b7c_2    conda-forge
fenics-dolfinx            0.9.0           py313h12399d7_114    conda-forge
fenics-ffcx               0.9.0              pyh2e48890_0    conda-forge
fenics-libbasix           0.9.0                hf7ae0cd_2    conda-forge
fenics-libdolfinx         0.9.0           py313hd0b34b2_114    conda-forge
fenics-ufcx               0.9.0                hb7f7608_0    conda-forge
fenics-ufl                2024.2.0           pyhd8ed1ab_1    conda-forge

A snippet of the more complicated variational form (non-MWE) is provided below if interested

# βˆ† Load mesh data and set up function spaces
    domain, _, ft = io.gmshio.read_from_msh(filename=file, comm=MPI.COMM_WORLD, rank=0, gdim=DIM)
    P2 = element("Lagrange", domain.basix_cell(), ORDER, shape=(domain.geometry.dim,))
    P1 = element("Lagrange", domain.basix_cell(), ORDER-1)
    Mxs = functionspace(mesh=domain, element=mixed_element([P2, P1]))
    Tes = functionspace(mesh=domain, element=("Lagrange", ORDER, (DIM, DIM)))

    # βˆ† Define subdomains
    V, _ = Mxs.sub(0).collapse()
    P, _ = Mxs.sub(1).collapse()

    # βˆ† Determine coordinates of space and create mapping tree
    x_n = Function(V)
    coords = np.array(x_n.function_space.tabulate_dof_coordinates()[:])
    tree = KDTree(coords)

    if t != "test":

        # βˆ† Setup functions for assignment
        ori, z_data = Function(V), Function(V)

        # βˆ† Assign angle and z-disc data
        azi, ele, zs = angle_assign(t, coords)

        # βˆ† Store angles
        CA, CE = np.cos(azi), np.cos(ele)
        SA, SE = np.sin(azi), np.sin(ele)

        # βˆ† Create interpolate functions
        # Β΅ Basis vector 1
        def nu_1(phi_xyz):
            _, idx = tree.query(phi_xyz.T, k=1)
            return np.array([CA[idx]*CE[idx], SA[idx]*CE[idx], -SE[idx]])
        # Β΅ Basis vector 2
        def nu_2(phi_xyz):
            _, idx = tree.query(phi_xyz.T, k=1)
            return np.array([-SA[idx], CA[idx], np.zeros_like(CA[idx])])
        # Β΅ Basis vector 3
        def nu_3(phi_xyz):
            _, idx = tree.query(phi_xyz.T, k=1)
            return np.array([CA[idx]*SE[idx], SA[idx]*SE[idx], CE[idx]])

        # βˆ† Create z_disc id data
        z_arr = z_data.x.array.reshape(-1, 3)
        z_arr[:, 0], z_arr[:, 1], z_arr[:, 2] = zs, azi, ele
        z_data.x.array[:] = z_arr.reshape(-1)

        # βˆ† Create angular orientation vector
        ori.interpolate(nu_1)

        # βˆ† Save foundations data
        z_data.name = "Node Mapping"
        ori.name = "Orientation Vectors"
        with io.VTXWriter(MPI.COMM_WORLD, f"_bp/_{t}/_ZSN.bp", z_data, engine="BP4") as fz:
            fz.write(0)
            fz.close()
        with io.VTXWriter(MPI.COMM_WORLD, f"_bp/_{t}/_ORI.bp", ori, engine="BP4") as fo:
            fo.write(0)
            fo.close()

        # βˆ† Create push tensor function
        Push = Function(Tes)

        # βˆ† Define push interpolation
        def forward(phi_xyz):
            _, idx = tree.query(phi_xyz.T, k=1)
            f00, f01, f02 = CA[idx]*CE[idx], -SA[idx], CA[idx]*SE[idx]
            f10, f11, f12 = SA[idx]*CE[idx], CA[idx], SA[idx]*SE[idx]
            f20, f21, f22 = -SE[idx], np.zeros_like(CE[idx]), CE[idx]
            return np.array([f00, f01, f02, f10, f11, f12, f20, f21, f22])

        # βˆ† Interpolate Push as Forward transform
        Push.interpolate(forward)

    else:
        # βˆ† Push as Forward transform
        Push = ufl.Identity(DIM)

    # βˆ† Load key function variables
    mx = Function(Mxs)
    v, q = ufl.TestFunctions(Mxs)
    u, p = ufl.split(mx)
    u_nu = Push * u

    # βˆ† Kinematics Setup
    i, j, k, l, a, b = ufl.indices(6)  
    I = ufl.Identity(DIM)  
    F = ufl.variable(I + ufl.grad(u_nu))

    if t != "test":

        # βˆ† Metric tensors
        # Β΅ [UNDERFORMED] Covariant basis vectors 
        A1, A2, A3 = Function(V), Function(V), Function(V)
        # Β¬ create base 1
        A1.interpolate(nu_1)
        # Β¬ create base 2
        A2.interpolate(nu_2)
        # Β¬ create base 3
        A3.interpolate(nu_3)
        
        # Β΅ [UNDERFORMED] Metric tensors
        G_v = ufl.as_tensor([
            [ufl.dot(A1, A1), ufl.dot(A1, A2), ufl.dot(A1, A3)],
            [ufl.dot(A1, A2), ufl.dot(A2, A2), ufl.dot(A2, A3)],
            [ufl.dot(A1, A3), ufl.dot(A2, A3), ufl.dot(A3, A3)]
        ]) 
        G_v_inv = ufl.inv(G_v)  
        # Β΅ [DEFORMED] Metric covariant tensors
        g_v = ufl.as_tensor([
            [ufl.dot(F * A1, F * A1), ufl.dot(F * A1, F * A2), ufl.dot(F * A1, F * A3)],
            [ufl.dot(F * A2, F * A1), ufl.dot(F * A2, F * A2), ufl.dot(F * A2, F * A3)],
            [ufl.dot(F * A3, F * A1), ufl.dot(F * A3, F * A2), ufl.dot(F * A3, F * A3)]
        ])

        # βˆ† Christoffel symbols 
        Gamma = ufl.as_tensor(
            0.5 * G_v_inv[k, l] * (ufl.grad(G_v[j, l])[i] + ufl.grad(G_v[i, l])[j] - ufl.grad(G_v[i, j])[l]),
            (i, j, k)
        )

        # βˆ† Covariant derivative
        covDev = ufl.as_tensor(ufl.grad(v)[i, j] + Gamma[i, k, j] * v[k], (i, j))

    else:
        # βˆ† Covariant derivative
        covDev = ufl.as_tensor(ufl.grad(v)[i, j], (i, j))

    # βˆ† Kinematics Tensors
    C = ufl.variable(F.T * F)  
    B = ufl.variable(F * F.T)  
    if t != "test":
        E = ufl.as_tensor(0.5 * (g_v - G_v))   
    else:
        E = ufl.as_tensor(0.5 * (C - I))
    J = ufl.det(F)   

    # βˆ† Extract Constitutive terms
    b0, bf, bt = gcc

    # βˆ† Exponent term
    Q = (
        bf * E[0,0]**2 + bt * 
        (
            E[1,1]**2 + E[2,2]**2 + E[1,2]**2 + E[2,1]**2 + 
            E[0,1]**2 + E[1,0]**2 + E[0,2]**2 + E[2,0]**2
        )
    )

    # βˆ† Seond Piola-Kirchoff 
    if t != "test":
        SPK = b0/4 * ufl.exp(Q) * ufl.as_matrix([
            [4*bf*E[0,0], 2*bt*(E[1,0] + E[0,1]), 2*bt*(E[2,0] + E[0,2])],
            [2*bt*(E[0,1] + E[1,0]), 4*bt*E[1,1], 2*bt*(E[2,1] + E[1,2])],
            [2*bt*(E[0,2] + E[2,0]), 2*bt*(E[1,2] + E[2,1]), 4*bt*E[2,2]],
        ]) - p * G_v_inv
    else:
        SPK = b0/4 * ufl.exp(Q) * ufl.as_matrix([
            [4*bf*E[0,0], 2*bt*(E[1,0] + E[0,1]), 2*bt*(E[2,0] + E[0,2])],
            [2*bt*(E[0,1] + E[1,0]), 4*bt*E[1,1], 2*bt*(E[2,1] + E[1,2])],
            [2*bt*(E[0,2] + E[2,0]), 2*bt*(E[1,2] + E[2,1]), 4*bt*E[2,2]],
        ]) - p * I

    # βˆ† Residual
    dx = ufl.Measure(integral_type="dx", domain=domain, metadata={"quadrature_degree": QUADRATURE})
    R = ufl.as_tensor(SPK[a, b] * F[j, b] * covDev[j, a]) * dx + q * (J - 1) * dx

Could you profile your code a bit to help steer the discussion? I.e., add timings between code segments to understand where the bottleneck lies.
Howmany nonlinear iterations are required?

On the HPC cluster, howmany cores do you use, and does it scale well? For your MWE, its just a mesh with 1000 nodes (mixed P/P-1 space with P=2, around 25000 dofs), so I presume you mean to say it is slow on just a few cores?

One quick thing to try, is pulling out the lines

    d_xy0 = dirichletbc(default_scalar_type(du), xx0, Mxs.sub(0).sub(X))
    d_xy1 = dirichletbc(default_scalar_type(0.0), xx1, Mxs.sub(0).sub(X))
    bc = [d_xy0, d_yy0, d_zy0, d_xy1, d_yy1, d_zy1]
    problem.bcs = bc

from the iteration, and simply reset the value of these BC in the loop. That helps FEniCS with some caching

1 Like

Hey @Stein - really appreciate the response.

I’ve added timing - the bottle neck is definitely the solver within each iteration.
The earlier iterations solver in <20s here are examples of the higher iteration values:

Solved 15%: iterations=5, residual=True
[2025-07-09 18:21:54.490] [info] Checking required entities per dimension
[2025-07-09 18:21:54.490] [info] Cell type: 0 dofmap: 6000x10
[2025-07-09 18:21:54.491] [info] Global index computation
[2025-07-09 18:21:54.491] [info] Got 2 index_maps
[2025-07-09 18:21:54.513] [info] GPS pseudo-diameter:(21) 9242-6
[2025-07-09 18:21:54.514] [info] Get global indices
FOURTH REGION [IT 3] [SOLVER SETUP]: -79.88781094551086

(79s for 5its, I wrote the timer backwards ignore the negative)
and

Solved 20%: iterations=8, residual=True
[2025-07-09 18:22:51.705] [info] Checking required entities per dimension
[2025-07-09 18:22:51.705] [info] Cell type: 0 dofmap: 6000x10
[2025-07-09 18:22:51.706] [info] Global index computation
[2025-07-09 18:22:51.706] [info] Got 2 index_maps
[2025-07-09 18:22:51.727] [info] GPS pseudo-diameter:(21) 9242-6
[2025-07-09 18:22:51.728] [info] Get global indices
FOURTH REGION [IT 4] [SOLVER SETUP]: -115.67447781562805

(115s for 8its)

However for 1000 nodes / 25000 dof that seems to be quite a long solve time.
My actual geometry is simple but has 100,000 - 500,000 nodes.

I’m reasonably new to HPC use but have been running it across 4-10 cores with varying amounts of RAM, it does not seem to scale well.
I am curious if I am meant to augment my code to better fit mpirun?

Implementing your BC change did decrease time by a few seconds though - thank you.

** edit: did not address your second question - for this MWE only 8 maximum NL its are required, for my other test case it is between 15 and 20

Hmm, so that’s roughly 16 seconds for the assembly + solving of a 3D P=2 case with 25k dofs on 1 core (?). I share your sentiment that one can probably squeeze out some more performance. Especially since a lot should be cachable.

First thing to check is if the bottleneck lies with the assembly or with the solve. Setting the log-level to debug gives for example these lines:

[2025-07-09 09:02:15.369] [info] Newton iteration 2: r (abs) = 0.5187813429061455 (tol = 1e-05), r (rel) = 0.06121085961686916 (tol = 1e-05)
[2025-07-09 09:02:15.778] [info] PETSc Krylov solver starting to solve system.
[2025-07-09 09:02:30.804] [debug] Elapsed wall, usr, sys time: 15.030000, 15.020000, 0.000000 (PETSc Krylov solver)

So the linear solve within one nonlinear solve dominates solve time (not the assembly).

Switching to the MUMPS direct solver already gives quite a speed bump on my system. Below defining your nonlinear solver:

from petsc4py import PETSc

ksp = solver.krylov_solver
opts = PETSc.Options()
ksp.setOptionsPrefix("")
opts["ksp_type"] = "preonly"
opts["pc_type"] = "lu"
opts["pc_factor_mat_solver_type"] = "mumps"
opts["ksp_monitor"] = None
ksp.setFromOptions()
1 Like

Thank you again for your response. Immediately the 8 iterations is 24s rather than 115s so that is a considerable improvement.

[2025-07-09 19:13:52.704] [info] Newton solver finished in 8 iterations and 8 linear solver iterations.
Solved 20%: iterations=8, residual=True
[2025-07-09 19:13:52.704] [info] Checking required entities per dimension
[2025-07-09 19:13:52.704] [info] Cell type: 0 dofmap: 6000x10
[2025-07-09 19:13:52.705] [info] Global index computation
[2025-07-09 19:13:52.705] [info] Got 2 index_maps
[2025-07-09 19:13:52.726] [info] GPS pseudo-diameter:(21) 9242-6
[2025-07-09 19:13:52.728] [info] Get global indices
FOURTH REGION [IT 4] [SOLVER SETUP]: -24.54500389099121

I am curious as to what MUMPS is facilitating here?
To improve understanding: what does β€œcached” mean?

I’ll consider this solved - I really appreciate your time @Stein :slight_smile:

1 Like

I wrote the starting bit about caching before switching the linear solver, so that may not apply entirely (I suppose there is still some caching stuff you could try, if you wanted to take performance some steps further. E.g. switching PETSc’s SNES solver and using inexact Newton iterations).

T.b.h. I’m not entirely sure why mumps is so much faster than the default. It is definitely the more powerful solver, so I always switch to mumps as my default and I almost always get better performance that way.

For your large system you may need to explore iterative linear solver methods, but in particular for mixed problems their preconditioning gets challenging.

1 Like

Brilliant :slight_smile:
Thank you for your time