Solving 3D Stokes problem is slower on dolfinx than dolfin

I am trying to solve a 3D stokes problem using P1-P1 mixed element with a pressure stabilisation. Mumps direct solver is used. Here is the demo of the code:

import os
os.environ['OMP_NUM_THREADS'] = '1'
from mpi4py import MPI
import numpy as np
import time
import ufl,basix
from dolfinx import fem, io, mesh
from ufl import grad, div, nabla_grad, dx, inner, dot
from dolfinx.fem.petsc import LinearProblem

# build mesh
msh = mesh.create_box(MPI.COMM_WORLD,
                    [np.array([0, 0, 0]), np.array([2, 1, 0.3])],
                    [100, 50, 15], cell_type=mesh.CellType.tetrahedron)

# define element
Pv         = basix.ufl.element("CG", msh.basix_cell(), 1, shape=(msh.geometry.dim,))
P1         = basix.ufl.element("CG", msh.basix_cell(), 1)
TH         = basix.ufl.mixed_element([Pv, P1])

# create functionspace
W          = fem.functionspace(msh, TH)
V          = fem.functionspace(msh, Pv)

# define functions
dU         = ufl.TrialFunction(W)
U          = fem.Function(W)
velocity   = fem.Function(V)
f          = fem.Function(V)

# set boundary conditions
tdim = msh.topology.dim
fdim = tdim-1
msh.topology.create_connectivity(fdim, tdim)

def LEFT(xx):
    return np.isclose(xx[0],0)
def RIGHT(xx):
    return np.isclose(xx[0],2)

boundary_surface  = mesh.exterior_facet_indices(msh.topology)
left_surface      = mesh.locate_entities_boundary(msh, fdim, LEFT)
right_surface     = mesh.locate_entities_boundary(msh, fdim, RIGHT)
side              = np.concatenate((left_surface,right_surface))
other_surface     = np.setdiff1d(boundary_surface,side)

W_v_sub,_     = W.sub(0).collapse()
V_in          = fem.Function(W_v_sub)
V_noslip      = fem.Function(W_v_sub)
V_in.sub(0).interpolate(lambda x: 1*4*x[1]*(1-x[1])/(1)**2 * 4*x[2]*(0.3-x[2])/(0.3)**2 )

bcu      = fem.dirichletbc(V_in, fem.locate_dofs_topological([W.sub(0),W.sub(0).collapse()[0]], fdim, left_surface), W.sub(0))
bcu1     = fem.dirichletbc(V_in, fem.locate_dofs_topological([W.sub(0),W.sub(0).collapse()[0]], fdim, right_surface), W.sub(0))
bcu2     = fem.dirichletbc(V_noslip, fem.locate_dofs_topological([W.sub(0),W.sub(0).collapse()[0]], fdim, other_surface), W.sub(0))
bcs      = [bcu,bcu1,bcu2]

# define the linear problem
h           = 2*ufl.Circumradius(msh)
v, q        = ufl.TestFunctions(W)
u, p        = ufl.split(U)
du, dp      = ufl.split(dU)

F_a = (
+inner(  2*ufl.sym(nabla_grad(du)), ufl.sym(nabla_grad(v)))*dx
F_L = (

problem_U = LinearProblem(F_a, F_L, bcs=bcs, petsc_options={"ksp_type": "preonly", "pc_type": "lu", "pc_factor_mat_solver_type" : "mumps"})
# solve
Uh = problem_U.solve() 

if MPI.COMM_WORLD.rank == 0:
    print("Time: ",tb-ta)

This code takes about 16s with mpirun -n 24. However, when I calculate the same problem with dolfin it takes only about 8s. Here is the dolfin code:

import os
os.environ['OMP_NUM_THREADS'] = '1'
from dolfin import *
import time
from mpi4py import MPI as mpi
LIP = LagrangeInterpolator       
parameters["std_out_all_processes"] = False

# mesh      
mesh = BoxMesh(Point(0.0, 0.0, 0.0), Point(2.0, 1.0, 0.3), 100, 50, 15)

# define element
Pv     = VectorElement("CG", mesh.ufl_cell(), 1)  
P1     = FiniteElement("CG", mesh.ufl_cell(), 1)
TH     = MixedElement([Pv, P1])

# create function space and function
W      = FunctionSpace(mesh,TH)
U      = Function(W)
U0     = Function(W)
dU     = TrialFunction(W)

# set boudnary condition
def LEFT(x, on_boundary):
    tol = 1E-14
    return on_boundary and (x[0]==0)
def RIGHT(x, on_boundary):
    tol = 1E-14
    return on_boundary and (x[0]==2)
def SIDE(x, on_boundary):
    tol = 1E-14
    return on_boundary and (x[0]>0  + 1e-10) and (x[1]<2 - 1e-10)

exp_in = Expression(('1*4*x[1]*(1-x[1])/1 * 4*x[2]*(0.3-x[2])/0.09', '0.0', '0.0'), degree=2)
f      = Expression(('0.0', '0.0', '0.0'), degree=2)
V_in   = Function(FunctionSpace(mesh,Pv))
LIP.interpolate(V_in, exp_in)

bcu    = DirichletBC(W.sub(0), (0,0,0), SIDE)
bcu_1  = DirichletBC(W.sub(0), V_in, LEFT)
bcu_2  = DirichletBC(W.sub(0), V_in, RIGHT)
bcs    = [bcu,bcu_1,bcu_2]

# define linear problem
v, q = TestFunctions(W)
u, p = split(U)
u0, p0 = split(U0)
du, dp = split(dU)

+inner((grad(du)+grad(du).T), (0.5*grad(v)+0.5*grad(v).T))*dx

solve(a == L, U0, bcs)
# solver.solve(problem,U.vector())
info('Timing: Main newton solve {:4.2e} s'.format(time.time() - tic))

I also run these codes on a refined mesh. The time is 80s(dolfinx) vs 40s(dolfin). Still twice as slower.

Further, I tried 2D mesh with millions of points. The result doesn’t differs that much, but dolfinx is still around 10% ~ 15% slower.

I am wondering why could this happen? Is this related to the matrix assemble or the different setup of mumps solver? And what can I do to speed up dolfinx.

Question #1 on my mind: are you 100% sure that the comparison fair? For instance, is the same version of PETSc being used, and all its dependencies (e.g., mumps)?

I am not sure with setups since the code are copied from the demos. I didn’t make any change myself. is there any demos on checking and changing the parameters?

As for the versin of PETSC, is it possible to load the same version on dolfin and dolfinx?

My dolfin version is 2019.1 and the dolfinx version is 0.8.0

How did you install dolfin and dolfinx, and on which OS?

I am using Linux.
DOLFINx installed with conda install -c conda-forge fenics-dolfinx=0.8.0 mpich
DOLFIN installed with conda install -c conda-forge fenics

I cannot reproduce this behavior using 5 processes on my laptop.
With the legacy script takes

Process 0: Timing: Main newton solve 4.66e+01 s

while on it takes

Time:  33.32949423789978

Please also note that you have not explicitly specified mumps on the legacy code, and rely on it being enforced implicitly.
You can ensure that you are using the same solver in legacy with:

solve(a == L, U0, bcs, solver_parameters={"linear_solver": "mumps"})

Also, note that you are only prescribing velocity conditions to all boundaries, and pressure is determined up to a constant. You should enforce the appropriate null-space in both legacy fenics and DOLFINx to ensure that the direct solver can handle it.

Thanks,I would try your advise. Maybe I make some mistake on the setups.