Not getting a better performance in parallel

nami · February 11, 2021, 10:42pm

Hi all,

The following is a simplified code that solves a Reaction-Diffusion system. My main code is actually a Reaction-Diffusion-Advection but I think I can get my point across with this toy problem. It is time dependent but the following code focuses on only time step.

from dolfin import *
import numpy as np
import time
import logging

#MPI parameters
comm = MPI.comm_world
num_processes=  comm.Get_size()
this_process = comm.Get_rank()
###############################################################

#Solver configs
class Problem(NonlinearProblem):
    def __init__(self, J, F, bcs):
        self.bilinear_form = J
        self.linear_form = F
        self.bcs = bcs
        NonlinearProblem.__init__(self)

    def F(self, b, x):
        assemble(self.linear_form, tensor=b)
        for bc in self.bcs:
            bc.apply(b, x)

    def J(self, A, x):
        assemble(self.bilinear_form, tensor=A)
        for bc in self.bcs:
            bc.apply(A)
###############################################################
###############################################################
class CustomSolver1(NewtonSolver):
    def __init__(self):
        NewtonSolver.__init__(self, mesh.mpi_comm(),
                              PETScKrylovSolver(), PETScFactory.instance())

    def solver_setup(self, A, P, problem, iteration):
        self.linear_solver().set_operator(A)
        PETScOptions.set("ksp_max_it", 1000)
        PETScOptions.set("ksp_initial_guess_nonzero", "true")
        PETScOptions.set("ksp_type", "gmres")
        PETScOptions.set("ksp_monitor")
        PETScOptions.set("pc_type", "hypre")
        PETScOptions.set("pc_hypre_type", "euclid")

        self.linear_solver().set_from_options()
###############################################################

#Mesh
mesh = UnitSquareMesh(200,200)
###############################################################

# define a mixed function space
V = VectorFunctionSpace(mesh,'P',1)
S1 = FunctionSpace(mesh,'P',1)
###############################################################

#Defining Mixed Function Space
P1 = FiniteElement('P', triangle,1)
P2 = FiniteElement('P', triangle,2)
element1 = MixedElement([P1,P1,P1,P1])
Mixed_Space1 = FunctionSpace(mesh, element1)
#######################################################################

#Functions
U = Function(Mixed_Space1)
U_n = Function(Mixed_Space1)
Tn, Th, Tc, Tr = split(U)
Tn_n, Th_n, Tc_n, Tr_n = split(U_n)
v1, v2, v3, v4 = TestFunctions(Mixed_Space1)
#######################################################################

#Dimension of the space
d = V.dim()
###############################################################

#PDE Parameters dimensional
D_Tn, D_Th, D_Tc, D_Tr = 4.32e-6, 4.32e-6, 4.32e-6, 4.32e-6
kappa_Tn,kappa_Th, kappa_Tc, kappa_Tr = 2, 2, 2, 2
##############################################################

#Initial conditions
Tn_0 = Expression('Init1', degree=0, Init1=3)
Th_0 = Expression('Init2', degree=0, Init2=2)
Tc_0 = Expression('Init3', degree=0, Init3=1)
Tr_0 = Expression('Init4', degree=0, Init4=1)
###############################################################

#Interpolating ICs
Tn0 = interpolate(Tn_0, Mixed_Space1.sub(0).collapse())
Th0 = interpolate(Th_0, Mixed_Space1.sub(1).collapse())
Tc0 = interpolate(Tc_0, Mixed_Space1.sub(2).collapse())
Tr0 = interpolate(Tr_0, Mixed_Space1.sub(3).collapse())
###############################################################

#Assigning ICs
assign(U_n.sub(0),Tn0)
assign(U_n.sub(1),Th0)
assign(U_n.sub(2),Tc0)
assign(U_n.sub(3),Tr0)
##############################################################

# Create VTK files for visualization output
vtkfile_1 = XDMFFile(MPI.comm_world,"reaction_system/Tn.xdmf")
vtkfile_1.parameters["flush_output"] = True
vtkfile_2 = XDMFFile(MPI.comm_world,"reaction_system/Th.xdmf")
vtkfile_2.parameters["flush_output"] = True
vtkfile_3 = XDMFFile(MPI.comm_world,"reaction_system/Tc.xdmf")
vtkfile_3.parameters["flush_output"] = True
vtkfile_4 = XDMFFile(MPI.comm_world,"reaction_system/Tr.xdmf")
vtkfile_4.parameters["flush_output"] = True
##############################################################

k = Constant(0.01)
t=0.0

F1 = ((Tn-Tn_n)/k)*v1*dx+D_Tn*dot(grad(Tn),grad(v1))*dx-Constant(1)*v1*dx+Constant(1)*Tn*v1*dx+Constant(0.1)*Th*Tn*v1*dx+Constant(0.2)*Th*Tn*v1*dx+Constant(0.01)*Tn*v1*dx\
+ ((Th-Th_n)/k)*v2*dx+D_Th*dot(grad(Th),grad(v2))*dx+(Constant(0.5)*Tr+Constant(1))*Th*v2*dx\
+ ((Tc-Tc_n)/k)*v3*dx+D_Tc*dot(grad(Tc),grad(v3))*dx-Constant(0.1)*Th*Tn*v3*dx+(Constant(0.2)*Tr+Constant(0.01))*Tc*v3*dx\
+ ((Tr-Tr_n)/k)*v4*dx+D_Tr*dot(grad(Tr),grad(v4))*dx-Constant(0.2)*Th*Tn*v4*dx+Constant(0.7)*Tr*v4*dx\
#######################################################################



#Saving
Tn0.rename('Tn_','Tn_')
Th0.rename('Th_','Th_')
Tc0.rename('Tc_','Tc_')
Tr0.rename('Tr_','Tr_')
vtkfile_1.write(Tn0,t)
vtkfile_2.write(Th0,t)
vtkfile_3.write(Tc0,t)
vtkfile_4.write(Tr0,t)
##############################################################

#######################################################################
J1 = derivative(F1,U)
problem1 = Problem(J1,F1,[])
custom_solver1 = CustomSolver1()
custom_solver1.solve(problem1, U.vector())
##############################################################


Tn_, Th_, Tc_, Tr_= U.split()
##############################################################


#Renaming for animation plots
Tn_.rename('Tn_','Tn_')
Th_.rename('Th_','Th_')
Tc_.rename('Tc_','Tc_')
Tr_.rename('Tr_','Tr_')
#######################################################################

#######################################################################
vtkfile_1.write(Tn_,t)
vtkfile_2.write(Th_,t)
vtkfile_3.write(Tc_,t)
vtkfile_4.write(Tr_,t)
#######################################################################

print('Number of processes used',num_processes)
list_timings(TimingClear.clear, [TimingType.wall])

Now when I run this in series I get this:

[MPI_AVG] Summary of timings | reps wall avg wall tot

Apply (PETScMatrix) | 3 0.0016835 0.0050504
Apply (PETScVector) | 37 1.2478e-05 0.0004617
Assemble cells | 7 0.23717 1.6602
Build sparsity | 2 0.061506 0.12301
Compute SCOTCH graph re-ordering | 7 0.0037814 0.02647
Compute connectivity 1-2 | 1 0.0073968 0.0073968
Compute entities dim = 1 | 1 0.039101 0.039101
Delete sparsity | 2 3.9e-06 7.8e-06
DistributedMeshTools: reorder vertex values | 4 0.0036773 0.014709
Init dof vector | 6 0.0028735 0.017241
Init dofmap | 7 0.041515 0.29061
Init dofmap from UFC dofmap | 7 0.009452 0.066164
Init tensor | 2 0.0048962 0.0097925
Number distributed mesh entities | 7 2.5286e-06 1.77e-05
PETSc Krylov solver | 3 0.82722 2.4817
SCOTCH: call SCOTCH_graphBuild | 7 4.1171e-05 0.0002882
SCOTCH: call SCOTCH_graphOrder | 7 0.0028291 0.019804

But when I do it with “mpirun -n 8” I get:

Process 0: [MPI_AVG] Summary of timings | reps wall avg wall tot

Apply (PETScMatrix) | 3 0.010986 0.032957
Apply (PETScVector) | 37 0.00030367 0.011236
Assemble cells | 7 0.048845 0.34191
Build LocalMeshData from local Mesh | 1 0.014725 0.014725
Build distributed mesh from local mesh data | 1 0.11427 0.11427
Build local part of distributed mesh (from local mesh data) | 1 0.0048257 0.0048257
Build sparsity | 2 0.035817 0.071635
Compute SCOTCH graph re-ordering | 7 0.00056693 0.0039685
Compute connectivity 1-2 | 1 0.0010896 0.0010896
Compute entities dim = 1 | 1 0.0078631 0.0078631
Compute graph partition (SCOTCH) | 1 0.058645 0.058645
Compute local part of mesh dual graph | 1 0.0074662 0.0074662
Compute mesh entity ownership | 1 0.0025741 0.0025741
Compute non-local part of mesh dual graph | 1 0.00052979 0.00052979
Delete sparsity | 2 5.1437e-06 1.0287e-05
Distribute cells | 1 0.0041209 0.0041209
Distribute mesh (cells and vertices) | 1 0.011897 0.011897
Distribute vertices | 1 0.0052324 0.0052324
DistributedMeshTools: reorder vertex values | 12 0.002893 0.034716
Extract partition boundaries from SCOTCH graph | 1 0.00020005 0.00020005
Get SCOTCH graph data | 1 4.6375e-06 4.6375e-06
Init dof vector | 6 0.0016481 0.0098885
Init dofmap | 7 0.010739 0.07517
Init dofmap from UFC dofmap | 7 0.0024653 0.017257
Init tensor | 2 0.0021027 0.0042053
Number distributed mesh entities | 8 0.0017322 0.013857
Number mesh entities for distributed mesh (for specified vertex ids) | 1 0.011717 0.011717
PETSc Krylov solver | 3 3.0889 9.2666
SCOTCH: call SCOTCH_dgraphBuild | 1 4.915e-05 4.915e-05
SCOTCH: call SCOTCH_dgraphHalo | 1 0.0004909 0.0004909
SCOTCH: call SCOTCH_dgraphPart | 1 0.057151 0.057151
SCOTCH: call SCOTCH_graphBuild | 7 1.2593e-05 8.815e-05
SCOTCH: call SCOTCH_graphOrder | 7 0.00036042 0.0025229

My computer has 8 logical processors and I use “FEniCS/2019.2.0.dev0”. I do not understand why the PETSc krylov solver is slower in parallel. I believe I am using the proper solver and preconditioner. Could you please help me with this?

Best,

dokken · February 12, 2021, 7:08am

The problem is that you are running a too small problem to get any effect of using multiple processes. Since parallel computing requires communication, it is a balancing act to find how many dofs should be on each process.

For instance, if you increase the number of dofs in your problem by using a 500x500 unit square mesh (1 million dofs in your mixed space), the average time for using the krylov solver 10 times is 9.96 seconds in serial.
Using 2 procs it reduces to 9.086900045000002
However, increasing it to 3 procs, the time goes up to 9.6, 4 procs 9.5

So lets to the same with N=1000 (~4 million dofs):

Serial: 39.657
2 processes: 29.05
3 processes: 25.4
4 processes: 24.33
Attached is the full timings for this case for serial and 2 processes:

Serial:

[MPI_AVG] Summary of timings                 |  reps    wall avg    wall tot
----------------------------------------------------------------------------
Apply (PETScMatrix)                          |     3    0.026543    0.079629
Apply (PETScVector)                          |    37  7.6436e-06  0.00028281
Assemble cells                               |     7      3.3737      23.616
Build sparsity                               |     2     0.92588      1.8518
Compute SCOTCH graph re-ordering             |     7     0.14492      1.0145
Compute connectivity 1-2                     |     1     0.12209     0.12209
Compute entities dim = 1                     |     1     0.73784     0.73784
Delete sparsity                              |     2  1.1605e-06   2.321e-06
DistributedMeshTools: reorder vertex values  |     4    0.055076      0.2203
Init dof vector                              |     6    0.033172     0.19903
Init dofmap                                  |     7     0.76472       5.353
Init dofmap from UFC dofmap                  |     7     0.14358      1.0051
Init tensor                                  |     2    0.089218     0.17844
Number distributed mesh entities             |     7  5.2843e-07   3.699e-06
PETSc Krylov solver                          |     3      13.185      39.554
SCOTCH: call SCOTCH_graphBuild               |     7  0.00049532   0.0034673
SCOTCH: call SCOTCH_graphOrder               |     7      0.1243     0.87011

2 Processes

Process 0: [MPI_AVG] Summary of timings               |  reps    wall avg    wall tot
-----------------------------------------------------------------------------------------------------
Apply (PETScMatrix)                                                   |     3     0.02795     0.08385
Apply (PETScVector)                                                   |    37    0.002464    0.091168
Assemble cells                                                        |     7        1.82       12.74
Build LocalMeshData from local Mesh                                   |     1     0.17548     0.17548
Build distributed mesh from local mesh data                           |     1      2.6436      2.6436
Build local part of distributed mesh (from local mesh data)           |     1     0.27132     0.27132
Build sparsity                                                        |     2     0.94024      1.8805
Compute SCOTCH graph re-ordering                                      |     7    0.048955     0.34269
Compute connectivity 1-2                                              |     1    0.055547    0.055547
Compute entities dim = 1                                              |     1     0.37567     0.37567
Compute graph partition (SCOTCH)                                      |     1     0.29117     0.29117
Compute local part of mesh dual graph                                 |     1     0.30689     0.30689
Compute mesh entity ownership                                         |     1    0.072827    0.072827
Compute non-local part of mesh dual graph                             |     1   0.0032135   0.0032135
Delete sparsity                                                       |     2   1.338e-06   2.676e-06
Distribute cells                                                      |     1    0.099223    0.099223
Distribute mesh (cells and vertices)                                  |     1      0.7537      0.7537
Distribute vertices                                                   |     1     0.25513     0.25513
DistributedMeshTools: reorder vertex values                           |    12    0.033846     0.40615
Extract partition boundaries from SCOTCH graph                        |     1   0.0048294   0.0048294
Get SCOTCH graph data                                                 |     1   1.216e-06   1.216e-06
Init dof vector                                                       |     6    0.020221     0.12132
Init dofmap                                                           |     7     0.32877      2.3014
Init dofmap from UFC dofmap                                           |     7    0.071951     0.50365
Init tensor                                                           |     2    0.054538     0.10908
Number distributed mesh entities                                      |     8    0.065094     0.52075
Number mesh entities for distributed mesh (for specified vertex ids)  |     1     0.51871     0.51871
PETSc Krylov solver                                                   |     3      9.6925      29.077
SCOTCH: call SCOTCH_dgraphBuild                                       |     1  0.00071204  0.00071204
SCOTCH: call SCOTCH_dgraphHalo                                        |     1    0.006222    0.006222
SCOTCH: call SCOTCH_dgraphPart                                        |     1     0.27858     0.27858
SCOTCH: call SCOTCH_graphBuild                                        |     7  0.00026688   0.0018682
SCOTCH: call SCOTCH_graphOrder                                        |     7    0.040955     0.28669

Here you can observe that assemble cells is almost perfectly halfed, from 23.616 to 12.74, while the petsc krylov solver goes from 39 to 29 seconds, which could likely be improved by tuning the solver.

nami · February 12, 2021, 6:02pm

Thanks for the reply. I have two follow up questions. Does increasing the degree of the vector space elements have the same effect as refining the mesh here? How about increasing the number of equations in a system?

dokken · February 12, 2021, 6:13pm

There will be different effects for each of the changes you propose. All the step should make parallel computations be more suitable, as increasing the degree for the function space increase the number of dofs (size of the matrix).

nate · February 12, 2021, 6:57pm

As @dokken said, this problem is very small. There is a large overhead constructing the incomplete LU factorisation. Also it won’t scale particularly well unlike the \backsim \mathcal{O}(n) scaling of BoomerAMG. If you reuse the factorisation as an approximate preconditioner you’ll probably see better results. Don’t forget to reconstruct the preconditioner if the number of Krylov iterations starts growing and getting out of hand.

The dominant portions for me on 1 vs 4 processes:

np = 1
Assemble cells        |     7     0.18876      1.3213
Init dofmap           |     7     0.03306     0.23142
PETSc Krylov solver   |     3     0.28922     0.86766
total                 |     1      2.9261      2.9261
np = 4
Assemble cells        |     7    0.042934     0.30054
Init dofmap           |     7   0.0082954    0.058068
PETSc Krylov solver   |     3     0.29978     0.89933
total                 |     1        1.52        1.52

Also employing hyper threading (logical rather than physical cores) is typically detrimental to performance. You should be concerned when your solve time grows so much with an increase in the number of cores used as you have shown.

nami · February 12, 2021, 7:14pm

Thank you Nate for the reply. Things that you mentioned are all new to me and I cannot find anywhere how to apply them. What lines would you add to my code to make sure all of your suggestions are included? For the bigger problem I have, the Krylov iterations do get out of hand and I never thought I could fix that.

nami · February 13, 2021, 8:33pm

Dear @nate and @dokken I figured out how to reuse the preconditioner. It is done by simply adding the line
PETScOptions.set("ksp_reuse_preconditioner","true")

to my solver_setup. What I couldn’t find was how to reconstruct the preconditioner in my code. It sound kinda counterintuitive to me. I mean if I am reusing the preconditioner why would I want to reconstruct it? Doesn’t that mean that I am not reusing it anymore? Is this reconstruction suggested for time evolving problems (my main code actually evolves in time)? Could you tell me what the syntax is and how I can go about including it in my code above? Thank you so much for you help.

Topic		Replies	Views
Problem solving in parallel slower than in serial General	12	1405	April 4, 2023
Parallel for-loop with multiple solve-calls, MPI dolfin-adjoint	13	3712	December 10, 2020
Newton solver not converging for simple heat problem General	4	288	January 23, 2023
MPI performance question dolfinx	1	511	May 12, 2022
Time-dependent nonlinear reaction-diffusion not proceeding in time loop	1	587	January 13, 2023

Not getting a better performance in parallel

[MPI_AVG] Summary of timings | reps wall avg wall tot

Process 0: [MPI_AVG] Summary of timings | reps wall avg wall tot

Related topics