How to parallelize assembling and solving of PDE

Hey there!

I have to solve a complex problem from dynamic thermoelasticity. A corresponding toy problem would be the solution of the wave equation (with Dirichlet boundary values and a time-dependent force term):

import dolfin as df
import numpy as np

T=0.5
dt=0.01
nx=ny=1000

f=df.Expression('x[0]<=0.1 && x[1]<=0.1 ? fmin(10*t,1) : 0',t=0,degree=0)

mesh=df.UnitSquareMesh(nx,ny)
V=df.FunctionSpace(mesh,'CG',2)
bc=df.DirichletBC(V,df.Constant(0),df.CompiledSubDomain('on_boundary'))

u_m=df.Function(V)
u_mm=df.Function(V)

u=df.TrialFunction(V)
v=df.TestFunction(V)

a=u*v*df.dx+dt**2/4*df.dot(df.grad(u),df.grad(v))*df.dx
L=(2*u_m-u_mm)*v*df.dx-dt**2/4*df.dot(df.grad(2*u_m+u_mm),df.grad(v))*df.dx+dt**2*f*v*df.dx

u=df.Function(V)

t=0
N=np.int64(T/dt)

A,b=df.assemble_system(a,L,bc)

for l in range(N):
    
    t+=dt
    
    b=df.assemble(L)
    bc.apply(b)
    df.solve(A,u.vector(),b,'gmres','icc')
    
    u_mm.assign(u_m)
    u_m.assign(u)
    f.t=t

In my original problem assembling the right hand side takes roughly the same amount of time as solving the linear system. Currently, the code practically runs on a single core, and this is just too time consuming. Hence, I would like to benefit from those 16 cores of my machine.

Now, replacing the solve command line by

df.solve(A,u.vector(),b,'mumps')

allows to run the script via mpirun. This reduces the time for assembling by a factor of 10, but solving becomes slower by a factor of 2.

Hence, my question is: Is it possible to benefit from multicore processing when solving a linear system, and if not, is it possible to restrict parallelization to the assembling process?

1 Like

You should switch from a direct solver to an iterative solver if you would like to efficiently assemble large systems, see for instance the links in: Fast computation of a large system of PDEs
which mentions the usage of multigrids. Note that you should not treat a multigrid solver as a black boxs, and parameters and options should be adjusted to observe optimal performance

2 Likes

Hi Dokken, thank you for your quick answer. I must admit that I am a new to parallelizing fem computations.

I added a toy problem above and, as you can see now, I originally used an iterative solver. However, running this script with the mpirun command, gives an error message; now I see that the precondtioner icc is the reason for that.

So, indeed, replacing the solve command line by

df.solve(A,u.vector(),b,'gmres')

and running the script via mpirun reduces the time for solving by a factor of 2.

Do you have any suggestions for further improvements? In particular, do you recommend the use of mpirun? Isn’t it possible to invoke parallel computation within the python script itself in order to have a better control on which parts to parallelize?

More generally, can you recommend any documentation on parallelizing fenics? (I found nothing, in fact.)

1 Like

~90% of fenics works in parallel out of the box (with usage of mpirun Which is the suggested way of parallelizing code. There are a few bits and pieces such as MultiMesh that does not have parallel support.

2 Likes

I am completely new to mpirun. Wouldn’t it make more sense to to invoke mpi within the python script itself in order to have a better control on which parts to parallelize? For instance, plots and prints are issued multiple times with mpirun…

You can easily use a combination of those. Yhis is specific to the problem and code you would like to execute.

Oh, ok, so this is possible? With mpi4py, for instance?

Sorry for the stupid questions, but this stuff is completely new to me and I found it hard to find documentation on parallel fenics…

Please see the mpi4py tutorial
along with Parallel multiple tasks in class-based program

More explicit usage with MPI4PY is done in dolfinx, see for instance: https://github.com/FEniCS/dolfinx/blob/master/python/demo/gmsh/demo_gmsh.py

2 Likes

Thank you, dokken! :blush:

In your initial example you combine gmres with icc, which is a little bizarre considering you’ve chosen a generalised Krylov solver with a preconditioner that expects a hermitian operator. Furthermore PETSc’s implementation of incomplete cholesky factorisation isn’t supported in parallel.

As @dokken says: direct factorisation (e.g. via mumps) will not scale optimally, but it’s my experience for 2D problems (\backsim \mathcal{O}(n^{\frac{3}{2}})) that it scales rather well up to around 10M DoFs with fewer than 64 processes. A scalable solver for your (looking at a glance) symmetric problem would be conjugate gradient with some multigrid preconditioner (e.g. hypre’s BoomerAMG).

Futhermore, DOLFIN’s solve() function is syntactic sugar for simple cases. This creates and destroys a PETSc KSP each time you solve the system. So your preconditioner/LU factorisation is not stored.

I recommend constructing a PETScKrylovSolver, setting the operator initially and then solving the system each time step. Typically PETSc detects whether the operator has changed and will only recompute the LU factorisation or preconditioner accordingly, however you can force this with PETScKrylovSolver::set_reuse_preconditioner().

4 Likes

Hi Nate! Thank you for your suggestions which will surely turn out to be very helpful.

In fact, I chose icc with gmres since this combination simply yields the shortest computation time in my specific problem.

Reusing the same solver in each time step sounds very reasonable, but had practically no impact on the computation time, so far. I am not sure whether it is reusing the preconditioner or not.

Unfortunately, I am really having a hard time finding any up-to-date information on the whole topic (for fenics 2019.2). for instance, it seems to me that the python solver interface changed a lot (the interface described here doesn’t exist anymore, it seems). I can’t even figure out how to read and write the solver parameters. In particular, I can’t find any method to “set_reuse_preconditioner”.

A link to some example or to some interface documentation would be great.

solve is a horrendous function which hides a lot of the crucial steps of computing a FEM solution.

If you want to do anything beyond the absolute basics, I highly encourage anyone to avoid solve, LinearVariationalSolver and NonlinearVariationalSolver. Use DOLFIN to assemble the linear algebra problem and solve it using PETSc or another linear algebra package du jour. See for example the undocumented elasticity demo.

The python binding for PETScKrylovSolver::set_reuse_preconditioner exists as expected:

In [1]: from dolfin import *                                                    

In [2]: PETScKrylovSolver.set_reuse_preconditioner?                             
Call signature:  PETScKrylovSolver.set_reuse_preconditioner(*args, **kwargs)
Type:            instancemethod
String form:     <instancemethod set_reuse_preconditioner at 0x7f628d372f40>
Docstring:       set_reuse_preconditioner(self: dolfin.cpp.la.PETScKrylovSolver, arg0: bool) -> None
Class docstring:
instancemethod(function)

Bind a function to a class.
2 Likes

Nate, thank you for pointing these things out to me. This really helped a lot reducing the computation time.

1 Like

Let me summarize my results:

  1. Employing an iterative solver on the one hand and creating a PETScKrylovSolver object to be reused in each time step (including the set_reuse_preconditioner(True) option) on the other hand really gave me some considerable speedup. In fact, even tough my problem is symmetric and positive definite, gmres with ilu gives the best results (without mpi).

  2. WITHOUT mpirun, the solution involves more or less one of 16 available threads (8 cores plus hyperthreading). WITH mpirun, those 16 threads are running at almost 100%, but, nevertheless, the speedup quite modest, about 1.5. A closer analysis shows that assembling is much faster now, but solving is SLOWER.

So, does anyone have an idea why solving is becoming slower with mpirun than without? I am employing gmres with the default preconditioner (whatever that is). The default preconditioner turned out to give the best result of those available (ilu and icc seem not to work with mpirun).

EDIT: The PETSc docs point out that one reason might be that the number of iterations is getting larger in parallel computing. However, monitoring the convergence shows that this is not the case here.