Solving poisson equation repeatedly: Performance optimization

I want to solve the Poisson equation repeatedly for a varying charge distribution. I know, that this topic has already been discussed (see e.g. Efficiently solving Poisson eqn repeatedly - dolfinx - FEniCS Project), but the information given there does either not work out for me or is already implemented. I nevertheless have still the feeling, that my code is not optimal and can be further optimized.

Please consider the following MWE, which was the fastest code that I was able to produce . The size of the mesh is similar to the mesh size (750’000 cells) in my application (600’000 cells to 6’000’000 cells).

import ufl
import numpy as np

from petsc4py import PETSc
from mpi4py import MPI

from dolfinx import fem, mesh, io, plot, default_scalar_type
from dolfinx.fem.petsc import assemble_vector, assemble_matrix, create_vector, apply_lifting, set_bc
from dolfinx.mesh import create_unit_square, locate_entities, create_unit_cube, exterior_facet_indices, CellType

import petsc4py as petsc4py
import time



# Create mesh
dfx_mesh = create_unit_cube(MPI.COMM_WORLD, 50, 50, 50, CellType.tetrahedron)
c_to_v = dfx_mesh.topology.connectivity(3,0) 
number_cells = c_to_v.num_nodes   
print("Mesh with "+str(number_cells)+" cells initialized.")

start = time.time()
# Initialize function spaces
function_space = fem.functionspace(dfx_mesh, ("Lagrange", 1))
charge_density_space = fem.functionspace(dfx_mesh, ("DG", 0))
 
# Create boundary condition

boundary_value = 0
uD = default_scalar_type(boundary_value)
tdim = dfx_mesh.topology.dim
fdim = tdim - 1 
dfx_mesh.topology.create_entities(fdim)
dfx_mesh.topology.create_connectivity(fdim, tdim)
boundary_facets = mesh.exterior_facet_indices(dfx_mesh.topology)
boundary_dofs = fem.locate_dofs_topological(function_space, fdim, boundary_facets)

our_bcs = [fem.dirichletbc(uD, boundary_dofs, function_space)]


#Create test and trial function
phi = ufl.TrialFunction(function_space)
v = ufl.TestFunction(function_space)

# Define solution variable
PHI = fem.Function(function_space)
PHI.name = "PHI"
        
# Create charge density
rho = fem.Function(charge_density_space)
rho.name = "rho"

# Set initial charge distribution
rho.x.array[:] = np.zeros(number_cells)[:]

# Create forms
a = ufl.dot(ufl.grad(phi), ufl.grad(v))*ufl.dx
L = rho*v*ufl.dx

# Compile forms
bilinear_form = fem.form(a)
linear_form = fem.form(L)

# Assemble matrices and vectors
A = assemble_matrix(bilinear_form, bcs=our_bcs)
A.assemble()

b = create_vector(linear_form)
assemble_vector(b, linear_form)

# Define solver
        
solver = PETSc.KSP().create(dfx_mesh.comm)
solver.setOperators(A)
solver.setType(PETSc.KSP.Type.PREONLY)
solver.getPC().setType(PETSc.PC.Type.CHOLESKY)
solver.getPC().setFactorSolverType("cholmod")
 
# Apply Dirichlet boundary condition to the vector
apply_lifting(b, [bilinear_form], [our_bcs])
set_bc(b, our_bcs)

# Solve linear problem for the first time, to save time at later calls. 
solver.solve(b, PHI.x.petsc_vec)
solver.getPC().setReusePreconditioner(True)

end = time.time()
print("Poisson solver initialization time: ", str(end-start)+"s")

def solve():
    '''
    Solve-function to be called repeatedly.
    '''

    start = time.time()
    b = create_vector(linear_form)
    assemble_vector(b, linear_form)

    # Apply Dirichlet boundary condition to the vector
    apply_lifting(b, [bilinear_form], [our_bcs])
    set_bc(b, our_bcs)

    # Solve linear problem
    solver.solve(b, PHI.x.petsc_vec)

    end = time.time()
    print("Poisson solver solution time: ", str(end-start)+"s")


rho = np.zeros(number_cells)
cell_index = np.random.randint(0, number_cells)
rho[cell_index] = 100
solve()

rho = np.zeros(number_cells)
cell_index = np.random.randint(0, number_cells)
rho[cell_index] = 15
solve()

rho = np.zeros(number_cells)
cell_index = np.random.randint(0, number_cells)
rho[cell_index] = 36
solve()

This code gives as an output:

Mesh with 750000 cells initialized.
Poisson solver initialization time:  7.316346883773804s
Poisson solver solution time:  0.14932680130004883s
Poisson solver solution time:  0.12883353233337402s
Poisson solver solution time:  0.1407771110534668s

Edit/NB: I have also experimented with cholmod as a solver, which is a little bit faster in the context of my MWE (with the initial preconditioning being significantly faster), although it is somehow a little bit slower within the context of my original application:

Mesh with 750000 cells initialized.
Poisson solver initialization time:  2.8728795051574707s
Poisson solver solution time:  0.09317612648010254s
Poisson solver solution time:  0.13088488578796387s
Poisson solver solution time:  0.12652373313903809s

I compiled PETSc with:

./configure \
        --COPTFLAGS="-O2" \
        --CXXOPTFLAGS="-O2" \
        --FOPTFLAGS="-O2" \
        --with-64-bit-indices=no \
        --with-debugging=no \
        --with-fortran-bindings=no \
        --with-shared-libraries \
        --with-cuda \
        --download-hypre \
        --download-metis \
        --download-mumps-avoid-mpi-in-place \
        --download-mumps \
        --download-ptscotch \
        --download-scalapack \
        --download-spai \
        --download-suitesparse \
        --download-superlu \
        --download-cusparse \
        --with-scalar-type=real \
        --with-precision=double \

Unfortunately, I was not able to test superlu_dist, since its installation did fail in my case.

My general question is now: How to speed up this code further? At least a speedup by factor 3 would be necessary for my application. Any help here is appreciated!

More detailed questions in this context are:

1.) How to enable gpu acceleration?

Using

./configure [...] --with-cuda [---]

at the installation of petsc did not significantly alter the performance of the individual calls of solve(). Is there some not too complicated way to benefit from gpu-accelleration within the individual solve()-steps?

2.) Is rebuilding the vector within solve() necessary?

In Efficiently solving Poisson eqn repeatedly - #3 by dokken it is claimed, that it should not be necessary, to recreate/reassemble the vector b within solve(). But unfortunately, my test cases fail, if I don’t recreate and reassemble the vector b within solve() – i.e. the Poisson solver gives the wrong results, although it doesn’t crash. On the same time, profiling shows, that reassembly and recreation of b is a very time-consuming step.

3.) Do you know some adequate iterative method?

I have already played around with iterative methods, but although they speeded up preconditioning, calling solve() did also not speed up significantly. Do you have maybe an idea for an adequate iterative method (together with an associated solver) that I should give a try?

4.) Do you have more ideas for compiler flags?

I have already tested compiler flags within the compilation of the linear forms along the lines of JIT options and visualization using Pandas — FEniCSx tutorial. But this did only cause a marginal speed-up. Are there more options of compiler flags, that I should give a try?

Thank you very much for your help!

P.S. Funnily enough, the performance got much better when I one forgot accidentally setting the preconditioner via

solver.getPC().setType(PETSc.PC.Type.CHOLESKY).

Then the solution time speeds up by factor 3. But again, the Poisson solver then gives the wrong results in my test cases.

You need to zero out the entries in the vector before each solve, see for instance: Diffusion of a Gaussian function — FEniCSx tutorial where b is zeroed within the temporal loop.

See the fenics performance tests for the poisson problem

On another note, are you running your code in serial or parallel? Have you tested what speedup you get with 1, 2 and 4 processes? At least assembly scales very well with number of processes.

Thanks for your answer!

1.) I did modify the solve() function to:

def solve():
    '''
    Solve-function to be called repeatedly.
    '''
    start = time.time()
    with b.localForm() as loc_b:
        loc_b.set(0)
    assemble_vector(b, linear_form)

    # Apply Dirichlet boundary condition to the vector
    apply_lifting(b, [bilinear_form], [our_bcs])
    set_bc(b, our_bcs)

    # Solve linear problem
    solver.solve(b, PHI.x.petsc_vec)

    end = time.time()
    print("Poisson solver solution time: ", str(end-start)+"s")

Unfortunately, this did not lead to a significant speedup, since apparently the assembly-stage is much more costly than the creation-stage.

2.) Following GitHub - FEniCS/performance-test: Mini App for FEniCSx performance testing I tried:

#A.convert(mat_type=petsc4py.PETSc.Mat.Type.AIJCUSPARSE)

b = create_vector(linear_form)
assemble_vector(b, linear_form)
b.setType(petsc4py.PETSc.Vec.Type.CUDA)

PHI.x.petsc_vec.setType(petsc4py.PETSc.Vec.Type.CUDA)
solver = PETSc.KSP().create(dfx_mesh.comm)
solver.setOperators(A)
solver.setType(PETSc.KSP.Type.CG)
solver.setTolerances(rtol=1e-8)
solver.getPC().setType(PETSc.PC.Type.HYPRE)
solver.getPC().setHYPREType("boomeramg")
PETSc.Options().setValue("-pc_hypre_boomeramg_strong_threshold",0.7)
PETSc.Options().setValue("-pc_hypre_boomeramg_agg_nl",4)
PETSc.Options().setValue("-pc_hypre_boomeramg_agg_num_paths",2)

but unfortunately, this lead to

terminate called after throwing an instance of 'thrust::THRUST_200500_700_NS::system::system_error'

[...]

Aborted (core dumped)

But nevertheless, this seems promissing and I will check this further out!

3.) I don’t really know, how to measure the performance adequately when using MPI.

But mpirun -n 1 python3 foo.py gives:

Mesh with 750000 cells initialized.
Poisson solver initialization time:  5.785945653915405s
Poisson solver solution time:  0.11495041847229004s
Poisson solver solution time:  0.11482715606689453s
Poisson solver solution time:  0.11506056785583496s

while mpirun -n 2 python3 foo.py gives:

Mesh with 380711 cells initialized.
Mesh with 380462 cells initialized.
Poisson solver initialization time:  3.725405693054199s
Poisson solver initialization time:  3.725227117538452s
Poisson solver solution time:  0.06821203231811523s
Poisson solver solution time:  0.06821179389953613s
Poisson solver solution time:  0.06825900077819824s
Poisson solver solution time:  0.06825900077819824s
Poisson solver solution time:  0.06731867790222168s
Poisson solver solution time:  0.06731796264648438s

while mpirun -n 4 python3 foo.py unfortunately takes such a long time, that I had to kill the process. At least mpirun -n 2 seems hence to give a speedup.

4.) NB: I also have tried DG1 as elements for the functionspace, because I thought, that this should be more easier to parallelize. That lead to a significant speedup:

Mesh with 750000 cells initialized.
Poisson solver initialization time:  1.8930561542510986s
Poisson solver solution time:  0.023633241653442383s
Poisson solver solution time:  0.02355194091796875s
Poisson solver solution time:  0.023795366287231445s

Unfortunately, DG1 is currently not an option for my application.

Here is a slightly modified version of your code (as your rho-update does not do what you think it does.)

import ufl
import numpy as np

from petsc4py import PETSc
from mpi4py import MPI

from dolfinx import fem, mesh, default_scalar_type
from dolfinx.fem.petsc import (
    assemble_vector,
    assemble_matrix,
    create_vector,
    apply_lifting,
    set_bc,
)
from dolfinx.mesh import (
    create_unit_cube,
    CellType,
)

import time


# Create mesh
dfx_mesh = create_unit_cube(MPI.COMM_WORLD, 50, 50, 50, CellType.tetrahedron)
c_to_v = dfx_mesh.topology.connectivity(3, 0)
number_cells = c_to_v.num_nodes
print("Mesh with " + str(number_cells) + " cells initialized.")

start = time.perf_counter()
# Initialize function spaces
function_space = fem.functionspace(dfx_mesh, ("Lagrange", 1))
charge_density_space = fem.functionspace(dfx_mesh, ("DG", 0))

# Create boundary condition

boundary_value = 0
uD = default_scalar_type(boundary_value)
tdim = dfx_mesh.topology.dim
fdim = tdim - 1
dfx_mesh.topology.create_entities(fdim)
dfx_mesh.topology.create_connectivity(fdim, tdim)
boundary_facets = mesh.exterior_facet_indices(dfx_mesh.topology)
boundary_dofs = fem.locate_dofs_topological(function_space, fdim, boundary_facets)

our_bcs = [fem.dirichletbc(uD, boundary_dofs, function_space)]
our_bcs

# Create test and trial function
phi = ufl.TrialFunction(function_space)
v = ufl.TestFunction(function_space)

# Define solution variable
PHI = fem.Function(function_space)
PHI.name = "PHI"

# Create charge density
rho = fem.Function(charge_density_space)
rho.name = "rho"

# Set initial charge distribution
rho.x.array[:] = np.zeros(number_cells)[:]

# Create forms
a = ufl.dot(ufl.grad(phi), ufl.grad(v)) * ufl.dx
L = rho * v * ufl.dx

# Compile forms
bilinear_form = fem.form(a)
linear_form = fem.form(L)

# Assemble matrices and vectors
A = assemble_matrix(bilinear_form, bcs=our_bcs)
A.assemble()

b = create_vector(linear_form)

# Define solver
petsc_options = {
    "ksp_type": "cg",
    "pc_type": "hypre",
    "pc_hypre_type": "boomeramg",
    "pc_hypre_boomeramg_strong_threshold": 0.7,
    "pc_hypre_boomeramg_agg_nl": 4,
    "pc_hypre_boomeramg_agg_num_paths": 2,
    "ksp_rtol": 1e-8,
    "ksp_error_if_not_converged": True,
    "ksp_monitor": None,
}

opts = PETSc.Options()
for key, value in petsc_options.items():
    opts[key] = value


solver = PETSc.KSP().create(dfx_mesh.comm)
solver.setOperators(A)
solver.setFromOptions()
nsp = PETSc.NullSpace().create(constant=True)
A.setNearNullSpace(nsp)
solver.setUp()

end = time.perf_counter()
print("Poisson solver initialization time: ", str(end - start) + "s")

b = create_vector(linear_form)


def solve():
    """
    Solve-function to be called repeatedly.
    """

    start = time.perf_counter()
    with b.localForm() as loc_b:
        loc_b.set(0)
    assemble_vector(b, linear_form)
    # Apply Dirichlet boundary condition to the vector
    apply_lifting(b, [bilinear_form], [our_bcs])
    b.ghostUpdate(addv=PETSc.InsertMode.ADD_VALUES, mode=PETSc.ScatterMode.REVERSE)
    set_bc(b, our_bcs)
    b.ghostUpdate(addv=PETSc.InsertMode.INSERT, mode=PETSc.ScatterMode.FORWARD)
    # Solve linear problem
    solver.solve(b, PHI.x.petsc_vec)
    PHI.x.petsc_vec.ghostUpdate(
        addv=PETSc.InsertMode.INSERT, mode=PETSc.ScatterMode.FORWARD
    )
    end = time.perf_counter()
    print("Poisson solver solution time: ", str(end - start) + "s")


rho.x.array[:] = np.random.random(number_cells)
solve()

rho.x.array[:] = np.random.random(number_cells)
solve()

rho.x.array[:] = np.random.random(number_cells)
solve()

with the output:
Serial:

Poisson solver initialization time:  1.5285748639998928s
Poisson solver solution time:  0.30014192700014064s
Poisson solver solution time:  0.29618113399988033s
Poisson solver solution time:  0.28891980200000944s

2 Processes:

Mesh with 382823 cells initialized.
Poisson solver initialization time:  1.0471213470000293s
Poisson solver solution time:  0.1842988289999994s
Poisson solver solution time:  0.19156044199985445s
Poisson solver solution time:  0.19565346200010936s

4 processes:

Mesh with 190104 cells initialized.
Poisson solver initialization time:  0.7649741039999753s
Poisson solver solution time:  0.19498046800003976s
Poisson solver solution time:  0.19116438899982313s
Poisson solver solution time:  0.1942980929998157s

You observe that parallel communication makes 4 processes being less efficient than 2.
Changing to for instance lu with mumps, i.e.

petsc_options = {
    "ksp_type": "preonly",
    "pc_type": "lu",
    "pc_factor_mat_solver_type": "mumps",
    "ksp_error_if_not_converged": True,
    "ksp_monitor": None,
}
opts = PETSc.Options()
for key, value in petsc_options.items():
    opts[key] = value


solver = PETSc.KSP().create(dfx_mesh.comm)
solver.setOperators(A)
solver.setFromOptions()
solver.setUp()

(as you do not need to solve the system to invert the matrix)
I get:
serial:

Poisson solver initialization time:  9.758978084000091s
Poisson solver solution time:  0.19574231299998246s
Poisson solver solution time:  0.199016833000087s
Poisson solver solution time:  0.19563366099987434s

2 processes

Poisson solver initialization time:  8.43661053400001s
Poisson solver solution time:  0.12953514999981053s
Poisson solver solution time:  0.1228931680000187s
Poisson solver solution time:  0.12527213899988965s

4 processes

Mesh with 195566 cells initialized.
Poisson solver initialization time:  6.685400431000062s
Poisson solver solution time:  0.09559633699996084s
Poisson solver solution time:  0.09497269700000288s
Poisson solver solution time:  0.09188262299994676s

i.e. since your problem is fairly small, I would advice on using mumps, as you can then cache the inverse after the first solve (if you are going to run this many times).

Hi,

thanks for your help, which is really appreciated!

1.) Unfortunately, using boomeramg still triggers the same error as before, i.e.:

terminate called after throwing an instance of 'thrust::THRUST_200500_700_NS::system::system_error'

[...]

Aborted (core dumped)

But, since your performance results are not so promising, I won’t follow this approach further.

2.) Thanks for your updated code! I got the following results (within my docker container):

# mpirun --allow-run-as-root -n 1 python3 foo_new.py 

Mesh with 750000 cells initialized.
Poisson solver initialization time:  6.771883388981223s
  0 KSP Residual norm 1.381845389936e-03
  1 KSP Residual norm 1.313602668835e-16
Poisson solver solution time:  0.12583965808153152s
  0 KSP Residual norm 1.382014488128e-03
  1 KSP Residual norm 1.311000429556e-16
Poisson solver solution time:  0.12275488814339042s
  0 KSP Residual norm 1.381736467241e-03
  1 KSP Residual norm 1.311564258291e-16
Poisson solver solution time:  0.1240297188051045s

# mpirun --allow-run-as-root -n 2 python3 foo_new.py

Mesh with 380462 cells initialized.
Mesh with 380711 cells initialized.
Poisson solver initialization time:  4.019601537846029s
Poisson solver initialization time:  4.019710090942681s
  0 KSP Residual norm 1.380327200707e-03
  1 KSP Residual norm 1.284710263900e-16
Poisson solver solution time:  0.07814178476110101s
Poisson solver solution time:  0.07809854531660676s
  0 KSP Residual norm 1.381972992565e-03
  1 KSP Residual norm 1.286314138240e-16
Poisson solver solution time:  0.06840593414381146s
Poisson solver solution time:  0.06840664520859718s
  0 KSP Residual norm 1.381562255136e-03
  1 KSP Residual norm 1.281695657896e-16
Poisson solver solution time:  0.06873816484585404s
Poisson solver solution time:  0.06873790360987186s

# mpirun --allow-run-as-root -n 4 python3 foo_new.py

Mesh with 194653 cells initialized.
Mesh with 193329 cells initialized.
Mesh with 194647 cells initialized.
Mesh with 189068 cells initialized.
Poisson solver initialization time:  10.935429033823311s
Poisson solver initialization time:  10.935371587052941s
Poisson solver initialization time:  10.93540011998266s
Poisson solver initialization time:  10.935379961971194s
  0 KSP Residual norm 1.380460719801e-03
  1 KSP Residual norm 1.291803197819e-16
Poisson solver solution time:  0.7106715491972864s
Poisson solver solution time:  0.7202352080494165s
Poisson solver solution time:  0.7353924340568483s
Poisson solver solution time:  0.7354521546512842s
  0 KSP Residual norm 1.382064576410e-03
  1 KSP Residual norm 1.290200693368e-16
Poisson solver solution time:  0.3530288618057966s
Poisson solver solution time:  0.3759522899053991s
Poisson solver solution time:  0.353237580973655s
Poisson solver solution time:  0.36680434737354517s
  0 KSP Residual norm 1.381325415412e-03
  1 KSP Residual norm 1.290498919352e-16
Poisson solver solution time:  0.5511552807874978s
Poisson solver solution time:  0.5605625589378178s
Poisson solver solution time:  0.5610360531136394s
Poisson solver solution time:  0.5782771990634501s

Somehow, mpirun -n 4 sometimes hangs up (before “Mesh with xxx cells initialized” is printed) and nothing happens anymore. When this happens, afterwards no python script is executable within my container, anymore. I then have to rebuild my docker container. This is quite strange.

3.) One final question: In MATSOLVERSUPERLU_DIST — PETSc 3.23.4 documentation it is claimed, that superlu_dist will use the gpu, if petsc is configured with —with-cuda. Unfortunately I cannot test its performance, since I was not possible to install superlu_dist. Do you think, superlu_dist would lead to a performance gain within the solve-phase? If so, could you maybe measure the performance once for me? This would help me a lot, because then I could evaluate if putting time into getting superlu_dist to work is adequate. Thanks!

A huge thanks for your help!

I have one further question:

My initial example also worked for

function_space = fem.functionspace(dfx_mesh, ("DG", 1))

while your modified code leads then to:

Mesh with 750000 cells initialized.
Traceback (most recent call last):
  File "[---] line 93, in <module>
    solver.setUp()
  File "petsc4py/PETSc/KSP.pyx", line 1605, in petsc4py.PETSc.KSP.setUp
petsc4py.PETSc.Error: error code 76
[0] KSPSetUp() at /usr/local/petsc/src/ksp/ksp/interface/itfunc.c:415
[0] PCSetUp() at /usr/local/petsc/src/ksp/pc/interface/precon.c:1071
[0] PCSetUp_LU() at /usr/local/petsc/src/ksp/pc/impls/factor/lu/lu.c:121
[0] MatLUFactorNumeric() at /usr/local/petsc/src/mat/interface/matrix.c:3307
[0] MatFactorNumeric_MUMPS() at /usr/local/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:2018
[0] Error in external library
[0] MUMPS error in numerical factorization: INFOG(1)=-10, INFO(2)=7 (see users manual https://mumps-solver.org/index.php?page=doc "Error and warning diagnostics")

Since DG1 leads to a significant speedup, I am considering to modify some aspects in the context of my initial application, such that DG1 elements are suitable, too. Hence, I am wondering: Do you have a clue, why DG1 does not work within the scope of your script?

You cannot change the function space of the unknown to DG without altering the variational formulation.

I tested the code with the dolfinx docker images, where I had no issues. To me this seems like an issue with your custom build.

I rarely use superlu_dist, and do not have a GPU available for testing this.

1 Like

Thanks! I guess, that all my questions are answered now. Thank you!