Multiprocessing FEniCSx making simulation slower

Hello, I am currently working on a code to simulate the incompressible Navier-Stokes equations. The definition of the problem can be found at this topic. Only the geometry changed, having three cylinders instead of one. The problem I face concerns multiprocessing. I have to say that I tested the cylinder tutorial and multiprocessing works perfectly, clearly speeding up the simulation.

With my code, multiprocessing does not throw error, but makes the computing several times slower, and I don’t know why. I am clearly missing something concerning the use of multiprocessing in FEniCSx, so I put my code below. It is a bit long indeed, but it contains everything (parameters, mesh generation etc…) and the code is spaced and well commented. The code should be directly launchable via the command mpirun -np n_proc python my_script.py where n_proc is the number of processors you want to run in parallel and my_script.py is the python file in which you should wrote (if you have time to try) my code (which is written below).

My code is divided in 6 parts:

  1. All the imports required.

  2. A class FluidDynamics_env that represents my fluid environment, it is a base to built a GYM environment later. It has three main methods:

    • a __init__ method that defines the parameters of my 2D flow simulation, along with the type of elements and the function definitions (mixed-element function, velocity-pressure).
    • a init_unsteady_ns method that defines the problem to solve, that is to say the discretization scheme (described at this topic)
    • a one_time_step method that, as the name suggests, calculates the solution one time step further. It is also where the solver is defined.
  3. The definition of classes for boundary conditions. Only Dirichlet conditions:

    • for velocity: no slip at the walls and cylinders (obstacles, that can rotate around their respective axis), and Poseuille velocity profile at the inlet.
    • for pressure, always zero at the outlet.
  4. The mesh creation, well commented. There are two lines commented, you can uncomment them to see the mesh in a GUI.

  5. Definition of the boundary conditions using classes of part 3.

  6. The parameters of my simulation (gathered in a dictionary named parameters), instanciation of FluidDynamics_env object (whose class is defined in part 2), and flow calculation with a for loop, the calculation time is printed.

My script generates a xdmf file called solver_non_linear_test.xdmf. You can watch the result easily in paraview. I set the number of steps num_steps at 10 in the part 6 of my code, so that it is fast. With one processor, it runs for me in 3 seconds, but it take 3 times longer when running with 10 procs.

So my whole question is to know how I could fix this. The code is written just below, parts are indicated.

############################################
#-------------- 1 Imports -----------------#
############################################
import gmsh
import os
import numpy as np
import matplotlib.pyplot as plt
import tqdm.autonotebook
from mpi4py import MPI
from petsc4py import PETSc
from dolfinx.cpp.mesh import to_type, cell_entity_type
from dolfinx.fem import (Constant, Function, FunctionSpace, VectorFunctionSpace, assemble_scalar, 
	dirichletbc, form, locate_dofs_topological, Expression)
from dolfinx.fem.petsc import (apply_lifting, assemble_matrix, assemble_vector, create_vector, create_matrix, set_bc, LinearProblem, NonlinearProblem)
from dolfinx.graph import create_adjacencylist
from dolfinx.geometry import BoundingBoxTree, compute_collisions, compute_colliding_cells
from dolfinx.io import (VTXWriter, distribute_entity_data, gmshio, XDMFFile)
from dolfinx.mesh import create_mesh, meshtags_from_entities
from ufl import (FacetNormal, FiniteElement, Identity, Measure, TestFunction, TestFunctions, TrialFunction, VectorElement, MixedElement, as_vector, div, dot, ds, dx, inner, lhs, grad, nabla_grad, rhs, sym, curl, derivative, split)
from dolfinx.nls.petsc import NewtonSolver
from os import path, mkdir
from dolfinx import log
import gym
import decimal
import sys
import math

mesh_comm = MPI.COMM_WORLD

############################################
#--------- 2 Custom solver ----------------#
############################################
class FluidDynamics_env():
	"""
	Class that will be used to start the simulation by initializing everything (__init__ and init_unsteady_ns methods),
	each time step will be performed by one_time_step method.
	"""
	def __init__(self, parameters):
		"""
		Constructor method. We retrieve the parameters of the simulation, makes them instance attributes,
		create a mixed function space for velocity and pressure and specify the boundary conditions. The path
		for the eventual exported data is retrieved.
		
		Parameters
		----------
		parameters : dict
			dictionary that contains parameters of our simulation.
		"""
		
		# retrieve the parameters dictionary and its features
		self.parameters = parameters
		[self.dt, self.num_steps, self.nu, self.BDF, self.mesh, self.ft]\
		 = [parameters[key] for key in ["dt", "num_steps", "nu", "BDF", "mesh", "ft"]]

		# time initialization (beware of python's addition!)
		d = decimal.Decimal(str(self.dt))
		self.n_decimal = abs(d.as_tuple().exponent)
		self.t = 0.0

		# time at which the simulation will stop
		self.T = int(self.dt * self.num_steps)

		# creating the space on which the solution will be defined, here it is Taylor-Hood element P2-P1.
		velocity_element = VectorElement("Lagrange", self.mesh.ufl_cell(), 2)
		pressure_element = FiniteElement("Lagrange", self.mesh.ufl_cell(), 1)
		W = FunctionSpace(self.mesh, MixedElement([velocity_element, pressure_element]))
		self.velocity_element = velocity_element
		self.pressure_element = pressure_element
		self.W = W

		# boundary conditions
		bcs = []
		fdim = self.mesh.topology.dim - 1
		for bc in parameters["bcs"]:
			i = {"velocity":0, "pressure":1}[bc[0]]
			boundary_facets = self.ft.find(bc[2])
			boundary_dofs = locate_dofs_topological((W.sub(i), W.sub(i).collapse()[0]), self.mesh.topology.dim-1, boundary_facets)
			user_defined_bc = Function(W.sub(i).collapse()[0])
			user_defined_bc.interpolate(bc[1])
			bcs.append(dirichletbc(user_defined_bc, boundary_dofs, W.sub(i)))
		self.bcs = bcs

		# relaxation constant (see discretization scheme)
		self.relax = Constant(self.mesh, PETSc.ScalarType(1e-6))

		# initialization of the number of steps
		self.n_steps = 0

		# making the parameters dictionary an instance attribute
		self.parameters = parameters

		# writing solutions to file
		self.xdmf = XDMFFile(mesh_comm, "solver_non_linear_test.xdmf", "w")
		self.xdmf.write_mesh(self.mesh)

	# definition of the assign function (not yet implemented in dolfinx)
	def custom_assign(self, copy_func, original_func):
		"""
		Function to copy the nodal values from a function to another.

		Parameters
		----------
		copy_func : Function
			function that will receive the nodal values of original_func

		original_func : Function
			function from which nodal values are copied past into copy_func
		"""
		copy_func.x.array[:] = original_func.x.array


	def write_function_to_file(self, file, func, *names):
		"""
		method to write the solution. Since w is a mixed function space, we name each sub-function
		(velocity and pressure) and write them in a file for visualization in paraview

		Parameters
		----------
		file : str
			name of the file where solutions will be written
		func : Function
			function to write in file
		names : list of str
			names of the functions (in our case "velocity" and "pressure")
		"""
		if len(names) > 1:
			for i, name in enumerate(names):
				sub_func = func.sub(i)
				sub_func.name = name
				file.write_function(sub_func, self.t)
		else:
			func.name = names[0]
			file.write_function(func, self.t)

	def init_unsteady_ns(self):
		"""
		Initialization of the unsteady, incompressible Navier-Stokes problem. The temporal scheme (BDF for
		Backward Differential Formula) is defined (for time step 1, BDF1 is used then BDF2 is used).
		Initial conditions are set and the discretization scheme F (without BDF) is defined. 
		"""

		# initiate finite element test functions, v for the velocity field and q for the pressure
		v, q = TestFunctions(self.W)

		# initiate solution vector, it is a function of a mixed function space, the first one is for velocity field
		# and the second for pressure field (u on V and p on Q respectively, with W:=V*Q)
		self.w = Function(self.W)
		u, p = split(self.w)
		
		# only to visualize first step in file as it is self.w that will be written
		set_bc(self.w.vector, self.bcs)
		
		# write solution on step 0 (initial solution given by boundary conditions and zero everywhere else)
		if self.n_steps == 0:
			self.write_function_to_file(self.xdmf, self.w, "velocity", "pressure")

		# definition of the temporal scheme (here a BDF). A little trick is used to create an adaptative temporal
		# scheme. Meaning that at step 1, BDF1 is used, and after that, BDF2 is used. To better understand, see 
		# one_time_step method.
		if self.BDF >= 1: # BDF1
			self.w_prev1 = Function(self.W)
			u_prev1, p_prev1 = split(self.w_prev1)
			temporal_scheme_list = [inner(u - u_prev1, v)]
			assign_list = ['self.custom_assign(self.w_prev1, self.w)']
		if self.BDF >= 2: #BDF2
			self.w_prev2 = Function(self.W)
			u_prev2, p_prev2 = split(self.w_prev2)
			temporal_scheme_list.append(inner(Constant(self.mesh, PETSc.ScalarType(1.5)) * u\
										  - Constant(self.mesh, PETSc.ScalarType(2.0)) * u_prev1\
										  + Constant(self.mesh, PETSc.ScalarType(0.5)) * u_prev2, v))
			assign_list.append('self.custom_assign(self.w_prev2, self.w_prev1)')
		self.temporal_scheme_list = temporal_scheme_list
		assign_list.reverse()
		self.assign_list = assign_list

		# definition of the discretization scheme (without BDF)
		self.F = (inner(dot(grad(u), u_prev1), v) + inner(dot(grad(u_prev1), u), v) - inner(dot(grad(u_prev1), u_prev1), v)\
			+ self.nu * inner(grad(u), grad(v))\
			- p * div(v)\
			- q * div(u)\
			- self.relax * p * q) * self.dt * dx

	def one_time_step(self):
		"""
		Method that performs one time step forward (time step is self.dt)
		For the first time step, BDF1 is used as there is only one solution known: the initial condition.
		"""
		if self.n_steps < self.BDF:
			printr0("changing solver")
			if self.n_steps == 0:
				self.init_unsteady_ns()
				set_bc(self.w_prev1.vector, self.bcs)

			# defining the temporal scheme for the first steps of the simulation
			temporal_scheme = self.temporal_scheme_list[self.n_steps]

			# reunion of the BDF (tempora scheme) and F to form the complete discretization scheme
			F_BDF = temporal_scheme * dx + self.F

			# defining the problem
			problem = NonlinearProblem(F_BDF, self.w, bcs=self.bcs)
			self.solver = NewtonSolver(mesh_comm, problem)
			# ksp = self.solver.krylov_solver
			# opts = PETSc.Options()
			# option_prefix = ksp.getOptionsPrefix()
			# opts[f"{option_prefix}ksp_type"] = "preonly"
			# opts[f"{option_prefix}pc_type"] = "lu"
			# opts[f"{option_prefix}pc_factor_mat_solver_type"] = "mumps"
			# ksp.setFromOptions()

		# incrementing the time by one time step
		self.t = round(self.t + self.dt, self.n_decimal)

		# incrementing the number of steps
		self.n_steps += 1



		# now that w_prev1 and w_prev2 are defined (thanks to the previous assignation) as the solutions at
		# 1 and 2 previous time steps, we can use the discretizaton scheme to solve the problem for the unknown
		# w !
		tic = time.perf_counter()
		n, converged = self.solver.solve(self.w)
		assert(converged)
		toc = time.perf_counter()
		printr0("solving took {} seconds".format(toc - tic))


		# now the assignation takes place (w_prev1 becomes w_prev2 and w becomes w_prev1)
		tic = time.perf_counter()
		for assignation in self.assign_list:
			eval(assignation)
		toc = time.perf_counter()
		printr0("assigning took {} seconds".format(toc - tic))

		# here the little trick previously seen in init_unsteady_ns is used. For the first time step BDF1 is used, 
		# the unknown is w, and the kwnow solution is w_prev1: the initial condition (v and p equal zero), this is
		# done by evaluating the first element of assign_list. For the other steps, BDF2 is used, the unknown is w,
		# and the two previously known solutions are w_prev1 and w_prev2.
		tic = time.perf_counter()
		self.write_function_to_file(self.xdmf, self.w, "velocity", "pressure")
		toc = time.perf_counter()
		printr0("write to file took {} seconds".format(toc - tic))
		


############################################
#---- 3 Classes for Boundary conditions ---#
############################################
class Inlet_Velocity():
	def __init__(self, U_0, H, gdim):
		"""
		Parameters
		----------
			U_0 : float
				max velocity of the Poiseuille profile
			H : float
				spanwise length
			gdim : int
				dimension of the velocity field (here 2)
		"""
		self.U_0 = U_0
		self.H = H
		self.gdim = gdim

	def __call__(self, x):
		"""
		Parameters
		----------
		x : ?
			coordinate at which the boundary condition will be set

		Returns
		-------
		numpy.ndarray
			velocity vector at coordinate x
		"""
		values = np.zeros((self.gdim, x.shape[1]), dtype=PETSc.ScalarType)
		values[0] = self.U_0*(self.H + x[1])*(self.H - x[1])/(self.H**2)
		return values


# Cylinder velocity profile
class Obstacle_Velocity():
	def __init__(self, Omega, center, gdim):
		"""
		Parameters
		----------
		center : list
			coordinates of the center of the cylinder
		gdim : int
			dimension of the velocity field (here 2)
		Omega : float
			rotation velocity
		"""
		self.center = center
		self.Omega = Omega
		self.gdim = gdim

	def __call__(self, x):
		"""
		Parameters
		----------
		x : ?
			coordinate at which the boundary condition will be set

		Returns
		-------
		numpy.ndarray
			velocity vector at coordinate x
		"""
		values = np.zeros((self.gdim, x.shape[1]), dtype=PETSc.ScalarType)
		values[0] = - self.Omega * (x[1] - self.center[1])
		values[1] =   self.Omega * (x[0] - self.center[0]) 
		return values

# Constant condition
class Constant_Dirichlet():
	def __init__(self, *values):
		"""
		Parameters
		----------
		values : float or list of floats
			values of the field at the locations of the boundary condition
		"""
		self.values = values

	def __call__(self, x):
		"""
		Parameters
		----------
		x : ?
			coordinate at which the boundary condition will be set

		Returns
		-------
		numpy.ndarray
			field vector at coordinate x
		"""
		values = np.zeros((len(self.values), x.shape[1]), dtype=PETSc.ScalarType)
		for i, value in enumerate(self.values):
			values[i] = value
		return values


############################################
#---------- 4 Mesh Creation ---------------#
############################################

import time
wall_marker, inlet_marker, outlet_marker = range(1, 4)
obstacle_markers = list(range(4, 7))

### LOADING MESH AND BOUNDARY MARKERS

## As we have generated the mesh, we now need to load the mesh and corresponding
## facet markers into DOLFINx. See https://jsdokken.com/src/tutorial_gmsh.html for explanations

# from mpi4py import MPI
# mesh_comm = MPI.COMM_WORLD
gdim = 2
# mesh, cell_tags, facet_tags = gmshio.read_from_msh("pinball_mesh.msh", mesh_comm, 0, gdim=gdim)



def printr0(*message):
	"""
	Function to print a message only for the processor with rank 0

	Parameters
	----------
	message :
		the message to print
	"""
	if mesh_comm.rank == 0:
		print(*message)

# Calculating the time that mesh generation will take
tic = time.perf_counter()

gmsh.initialize()

r = 0.5 # radius of the 3 cylinders
c_x, c_y = [-3*math.sqrt(3)*r/2, 0, 0], [0, 3*r/2,-3*r/2] # coordinates of the centers of the three cylinders (front, top and bottom)

mesh_comm = MPI.COMM_WORLD
model_rank = 0
gdim = 2

if mesh_comm.rank == model_rank:
	gmsh.model.add("fluidic_pinball")

# target mesh size for any point
lc = 1

if mesh_comm.rank == model_rank:
	## 0 dim : Points definition
	# Rectangle delimiting the flow
	gmsh.model.geo.addPoint(-12*r, -12*r, 0, lc, 1)
	gmsh.model.geo.addPoint(40*r, -12*r, 0, lc, 2)
	gmsh.model.geo.addPoint(40*r, 12*r, 0, lc, 3)
	gmsh.model.geo.addPoint(-12*r, 12*r, 0, lc, 4)

	# Points to create the circles for the three cylinders (4 points for each cylinder)
	for i in range(3): 
		gmsh.model.geo.addPoint(c_x[i], c_y[i], 0, lc, 5 + 5*i) # center
		gmsh.model.geo.addPoint(c_x[i] + r, c_y[i], 0, lc, 6 + 5*i)
		gmsh.model.geo.addPoint(c_x[i], c_y[i] + r, 0, lc, 7 + 5*i)
		gmsh.model.geo.addPoint(c_x[i] - r, c_y[i], 0, lc, 8 + 5*i)
		gmsh.model.geo.addPoint(c_x[i], c_y[i] - r, 0, lc, 9 + 5*i)

	# Points to create the polygon (pentagon in this case) that will include the cylinders and the wake
	# In that pentagon, the mesh will be refined
	gmsh.model.geo.addPoint(c_x[0] - 2*r, c_y[0], 0, lc, 20)
	gmsh.model.geo.addPoint(c_x[1] - 2*r, c_y[1] + 2*r, 0, lc, 21)
	gmsh.model.geo.addPoint(40*r - 4*r, c_y[1] + 2*r, 0, lc, 22)
	gmsh.model.geo.addPoint(40*r - 4*r, c_y[2] - 2*r, 0, lc, 23)
	gmsh.model.geo.addPoint(c_x[2] - 2*r, c_y[2] - 2*r, 0, lc, 24)

	## 1 dim: Lines and Curves
	# Lines for the rectangle
	gmsh.model.geo.addLine(1, 2, 1)
	gmsh.model.geo.addLine(2, 3, 2)
	gmsh.model.geo.addLine(3, 4, 3)
	gmsh.model.geo.addLine(4, 1, 4)

	# Lines for the pentagone
	gmsh.model.geo.addLine(20, 21, 5)
	gmsh.model.geo.addLine(21, 22, 6)
	gmsh.model.geo.addLine(22, 23, 7)
	gmsh.model.geo.addLine(23, 24, 8)
	gmsh.model.geo.addLine(24, 20, 9)

	# Creation of the circles for the three cylinders (4 CircleArcs for each circle)
	for i in range(3):
		gmsh.model.geo.addCircleArc(6 + 5*i, 5 + 5*i, 7 + 5*i, tag=10 + 4*i)
		gmsh.model.geo.addCircleArc(7 + 5*i, 5 + 5*i, 8 + 5*i, tag=11 + 4*i)
		gmsh.model.geo.addCircleArc(8 + 5*i, 5 + 5*i, 9 + 5*i, tag=12 + 4*i)
		gmsh.model.geo.addCircleArc(9 + 5*i, 5 + 5*i, 6 + 5*i, tag=13 + 4*i)

	## 2 dim
	# Creation of the plane surface: rectangle - pentagone
	gmsh.model.geo.addCurveLoop(list(range(1, 5)), 1)
	gmsh.model.geo.addCurveLoop(list(range(5, 10)), 2)
	gmsh.model.geo.addPlaneSurface([1, 2], 1)


# Creation of a curve for the polygon
curved_loop_list = [2]

if mesh_comm.rank == model_rank:
	# Creation of curves loops for the three cylinders
	for i in range(3):
		gmsh.model.geo.addCurveLoop([10 + 4*i, 11 + 4*i, 12 + 4*i, 13 + 4*i], 3 + i)
		curved_loop_list.append(3 + i)

	# Creation of the plane surface: polygon - cylinders
	gmsh.model.geo.addPlaneSurface(curved_loop_list, 2)

	gmsh.model.geo.synchronize()

	## Fields
	# Distance field to calculate distance from the three cylinders
	gmsh.model.mesh.field.add("Distance", 1)
	gmsh.model.mesh.field.setNumbers(1, "CurvesList", list(range(10, 22)))
	gmsh.model.mesh.field.setNumber(1, "Sampling", 100)

	# Threshold field to refine the mesh around the three cylinders
	gmsh.model.mesh.field.add("Threshold", 2)
	gmsh.model.mesh.field.setNumber(2, "InField", 1)
	gmsh.model.mesh.field.setNumber(2, "SizeMin", lc / 20)
	gmsh.model.mesh.field.setNumber(2, "SizeMax", lc)
	gmsh.model.mesh.field.setNumber(2, "DistMin", r/2)
	gmsh.model.mesh.field.setNumber(2, "DistMax", r)

	# Constant mesh size inside polygon
	gmsh.model.mesh.field.add("Constant", 3)
	gmsh.model.mesh.field.setNumber(3, "VIn", lc/5)
	gmsh.model.mesh.field.setNumber(3, "VOut", lc)
	gmsh.model.mesh.field.setNumbers(3, "SurfacesList", [2])

	# Final field as in t10.py, take the minimum mesh size for all Threshold fields.
	gmsh.model.mesh.field.add("Min", 5)
	gmsh.model.mesh.field.setNumbers(5, "FieldsList", [2, 3])
	gmsh.model.mesh.field.setAsBackgroundMesh(5)

	## Create Physical Groups
	gmsh.model.addPhysicalGroup(1, [1, 3], name="walls") # walls: up and down sides of the rectangle
	gmsh.model.addPhysicalGroup(1, [4], name="inlet") # inlet: left side of the rectangle
	gmsh.model.addPhysicalGroup(1, [2], name="outlet") # outlet: right side of the rectangle
	gmsh.model.addPhysicalGroup(1, list(range(10, 14)), name="front_cylinder") # front obstacle (left most)
	gmsh.model.addPhysicalGroup(1, list(range(14, 18)), name="top_cylinder") # top obstacle (the up most)
	gmsh.model.addPhysicalGroup(1, list(range(18, 22)), name="bottom_cylinder") # bottom cylinder (the down most)
	gmsh.model.addPhysicalGroup(2, [1, 2], name="control_volume") # Reunion of the two plane surfaces: where fluid will flow

	gmsh.model.geo.synchronize()

	# Generate and write to file
	gmsh.model.mesh.generate(gdim)
	gmsh.write("pinball_mesh.msh")
	gmsh.write("pinball_mesh.vtk")

	# Launch the GUI to see the results:
	# if '-nopopup' not in sys.argv:
	#     gmsh.fltk.run()


mesh, cell_tags, facet_tags = gmshio.model_to_mesh(gmsh.model, mesh_comm, model_rank, gdim=gdim)
mesh.name = "Pinball"
cell_tags.name = "Cell markers"
facet_tags.name = "Facet markers"

# printing how long the mesh creation took
toc = time.perf_counter()
printr0("Mesh generation took {} secondes !".format(toc-tic))


############################################
#---- 5 Boundary conditions definition ----#
############################################

# rotation velocities for the three cylinders
Omegas = [-12, 12, 12]

# inlet velocity profile (Poiseuille)
bcu_inlet = ["velocity", Inlet_Velocity(2, 12*r, gdim), inlet_marker]

# velocity is null at the walls
bcu_walls = ["velocity", Constant_Dirichlet(0,0), wall_marker]

# no slip condition according to the rotation velocities
bcu_obstacles = []
for i in range(len(Omegas)):
	bcu_obstacles.append(["velocity", Obstacle_Velocity(Omegas[i], [c_x[i], c_y[i]], gdim), obstacle_markers[i]])

# pressure is null at the outlet    
bcp_outlet = ["pressure", Constant_Dirichlet(0), outlet_marker]


############################################
#--- Parameters definition and lauching ---#
############################################
parameters = {
	"num_steps": 10, # number of steps (number of times the pde system will be solved)
	"dt": 0.1, # time step
	"nu": 1/200, # a parameter in the Navier-Stokes equations (the pde system to solve)
	"BDF": 2, # Backward Differentation Formula order
	"bcs": [bcu_inlet, bcu_walls, *bcu_obstacles, bcp_outlet], # boundary conditions
	"mesh": mesh,
	"ft": facet_tags,
	"c": cell_tags,
}

# Instantiation of a FluidDynamics_env object
simulation = FluidDynamics_env(parameters)




# calculating how long the simulation took
tic = time.perf_counter()
for i in range(parameters["num_steps"]):
	printr0("#***********#\n  Step number {} #**********#".format(i))

	# performing on step
	simulation.one_time_step()
	if i == 2:
		tic = time.perf_counter()

toc = time.perf_counter()
printr0("Simulation took {} secondes !".format(toc-tic))

gmsh.finalize()

I would be grateful if you could run my code. I don’t think there is a need to understand everything about it, but with the great scheme of its functioning (the 6 parts), I think it can be relatively easy to find out why multiprocessing fails at speeding up the simulation (provided one have a better understanding than me of how multiprocessing works with FEniCSx).

Thanks and regards.

I would suggest you start by timing different operations, so that you can see what parts of the code takes time to simulate, and then you can pinpoint what does not scale.

This can make it easier for others to Give you suggestions as to how to improve performance

In particular:

Does a lot of operations that shouldn’t be performed inside a time loop, like generating the nonlinear problem and newton solver, as They should not change over time. See:
https://jsdokken.com/dolfinx-tutorial/chapter2/ns_code2.html
Here, very few variables are defined inside the time loop.

In my code, generating the nonlinear problem and newton solver are only done for the 2 first steps, as the BDF changes. From step 3, they do not change (as specified by the if self.n_steps < self.BDF condition). Anyway I will follow your advice and come back with more precision I hope.

Ok so I did what you suggested using the following scheme (I included that in my first post’s comment by editing it):

tic = time.perf_counter()
# Performing Operation Op
toc = time.perf_counter()
printr0("operation OP took {} seconds".format(toc - tic))

I also printed when the solver is changed. The results is printed below:

With one proc:

TOP1$ mpirun -np 1 python test_pinball_parallel.py 
Warning : Unknown curve 22
Warning : Unknown curve 23
Warning : Unknown curve 24
Warning : Unknown curve 25
Info    : Meshing 1D...
Info    : [  0%] Meshing curve 1 (Line)
Info    : [ 10%] Meshing curve 2 (Line)
Info    : [ 10%] Meshing curve 3 (Line)
Info    : [ 20%] Meshing curve 4 (Line)
Info    : [ 20%] Meshing curve 5 (Line)
Info    : [ 30%] Meshing curve 6 (Line)
Info    : [ 30%] Meshing curve 7 (Line)
Info    : [ 40%] Meshing curve 8 (Line)
Info    : [ 40%] Meshing curve 9 (Line)
Info    : [ 50%] Meshing curve 10 (Circle)
Info    : [ 50%] Meshing curve 11 (Circle)
Info    : [ 60%] Meshing curve 12 (Circle)
Info    : [ 60%] Meshing curve 13 (Circle)
Info    : [ 70%] Meshing curve 14 (Circle)
Info    : [ 70%] Meshing curve 15 (Circle)
Info    : [ 80%] Meshing curve 16 (Circle)
Info    : [ 80%] Meshing curve 17 (Circle)
Info    : [ 90%] Meshing curve 18 (Circle)
Info    : [ 90%] Meshing curve 19 (Circle)
Info    : [100%] Meshing curve 20 (Circle)
Info    : [100%] Meshing curve 21 (Circle)
Info    : Done meshing 1D (Wall 0.00224896s, CPU 0.002252s)
Info    : Meshing 2D...
Info    : [  0%] Meshing surface 1 (Plane, Frontal-Delaunay)
Info    : [ 50%] Meshing surface 2 (Plane, Frontal-Delaunay)
Info    : Done meshing 2D (Wall 0.0960175s, CPU 0.095957s)
Info    : 5130 nodes 10512 elements
Warning : ------------------------------
Warning : Mesh generation error summary
Warning :     4 warnings
Warning :     0 errors
Warning : Check the full log for details
Warning : ------------------------------
Info    : Writing 'pinball_mesh.msh'...
Info    : Done writing 'pinball_mesh.msh'
Info    : Writing 'pinball_mesh.vtk'...
Info    : Done writing 'pinball_mesh.vtk'
Mesh generation took 0.154706476998399 seconds !

#*********** Step number 0 **********#
changing solver
solving took 0.4493800980053493 seconds
assigning took 0.00011367299885023385 seconds
write to file took 0.0036510049976641312 seconds

#*********** Step number 1 **********#
changing solver
solving took 0.44174166599987075 seconds
assigning took 0.00011493099736981094 seconds
write to file took 0.00605353299761191 seconds

#*********** Step number 2 **********#
solving took 0.2945928550034296 seconds
assigning took 0.00011299200559733436 seconds
write to file took 0.0038427510007750243 seconds

#*********** Step number 3 **********#
solving took 0.28893588399660075 seconds
assigning took 0.00011320899648126215 seconds
write to file took 0.005272765003610402 seconds

#*********** Step number 4 **********#
solving took 0.2833328859996982 seconds
assigning took 0.00011347099643899128 seconds
write to file took 0.004115570001886226 seconds

#*********** Step number 5 **********#
solving took 0.28557718800584553 seconds
assigning took 0.00011610399815253913 seconds
write to file took 0.0038458839990198612 seconds

#*********** Step number 6 **********#
solving took 0.2880920169991441 seconds
assigning took 0.00011410099978093058 seconds
write to file took 0.005538429002626799 seconds

#*********** Step number 7 **********#
solving took 0.2874544110018178 seconds
assigning took 0.00011205599730601534 seconds
write to file took 0.0038910889998078346 seconds

#*********** Step number 8 **********#
solving took 0.28381079799873987 seconds
assigning took 0.00012047500058542937 seconds
write to file took 0.0039026529993861914 seconds

#*********** Step number 9 **********#
solving took 0.28751488299894845 seconds
assigning took 0.00011340500350343063 seconds
write to file took 0.005601992001174949 seconds
Simulation took 2.037940680005704 secondes !

With 10 procs:

TOP1$ mpirun -np 10 python test_pinball_parallel.py 
Warning : Unknown curve 22
Warning : Unknown curve 23
Warning : Unknown curve 24
Warning : Unknown curve 25
Info    : Meshing 1D...
Info    : [  0%] Meshing curve 1 (Line)
Info    : [ 10%] Meshing curve 2 (Line)
Info    : [ 10%] Meshing curve 3 (Line)
Info    : [ 20%] Meshing curve 4 (Line)
Info    : [ 20%] Meshing curve 5 (Line)
Info    : [ 30%] Meshing curve 6 (Line)
Info    : [ 30%] Meshing curve 7 (Line)
Info    : [ 40%] Meshing curve 8 (Line)
Info    : [ 40%] Meshing curve 9 (Line)
Info    : [ 50%] Meshing curve 10 (Circle)
Info    : [ 50%] Meshing curve 11 (Circle)
Info    : [ 60%] Meshing curve 12 (Circle)
Info    : [ 60%] Meshing curve 13 (Circle)
Info    : [ 70%] Meshing curve 14 (Circle)
Info    : [ 70%] Meshing curve 15 (Circle)
Info    : [ 80%] Meshing curve 16 (Circle)
Info    : [ 80%] Meshing curve 17 (Circle)
Info    : [ 90%] Meshing curve 18 (Circle)
Info    : [ 90%] Meshing curve 19 (Circle)
Info    : [100%] Meshing curve 20 (Circle)
Info    : [100%] Meshing curve 21 (Circle)
Info    : Done meshing 1D (Wall 0.00211931s, CPU 0.002122s)
Info    : Meshing 2D...
Info    : [  0%] Meshing surface 1 (Plane, Frontal-Delaunay)
Info    : [ 50%] Meshing surface 2 (Plane, Frontal-Delaunay)
Info    : Done meshing 2D (Wall 0.0999824s, CPU 0.096669s)
Info    : 5130 nodes 10512 elements
Warning : ------------------------------
Warning : Mesh generation error summary
Warning :     4 warnings
Warning :     0 errors
Warning : Check the full log for details
Warning : ------------------------------
Info    : Writing 'pinball_mesh.msh'...
Info    : Done writing 'pinball_mesh.msh'
Info    : Writing 'pinball_mesh.vtk'...
Info    : Done writing 'pinball_mesh.vtk'
Mesh generation took 0.21981262899498688 secondes !

#*********** Step number 0 **********#
changing solver
solving took 0.9320117180031957 seconds
assigning took 7.355699926847592e-05 seconds
write to file took 0.0012539090021164156 seconds

#*********** Step number 1 **********#
changing solver
solving took 1.1791208009963157 seconds
assigning took 9.262900130124763e-05 seconds
write to file took 0.001420952998159919 seconds

#*********** Step number 2 **********#
solving took 0.7670692019964918 seconds
assigning took 9.635600144974887e-05 seconds
write to file took 0.0015197630054899491 seconds

#*********** Step number 3 **********#
solving took 0.767753072999767 seconds
assigning took 7.582100079162046e-05 seconds
write to file took 0.0016738880003686063 seconds

#*********** Step number 4 **********#
solving took 0.7144116120034596 seconds
assigning took 7.484799425583333e-05 seconds
write to file took 0.0015132079934119247 seconds

#*********** Step number 5 **********#
solving took 1.065860167997016 seconds
assigning took 7.135100167943165e-05 seconds
write to file took 0.0013955670001450926 seconds

#*********** Step number 6 **********#
solving took 0.9430404039958376 seconds
assigning took 9.045399929163978e-05 seconds
write to file took 0.002463064003677573 seconds

#*********** Step number 7 **********#
solving took 0.8550480589983636 seconds
assigning took 8.743999933358282e-05 seconds
write to file took 0.0017239599983440712 seconds

#*********** Step number 8 **********#
solving took 0.8422839449995081 seconds
assigning took 9.565900109009817e-05 seconds
write to file took 0.021277726998960134 seconds

#*********** Step number 9 **********#
solving took 0.9694747069952427 seconds
assigning took 9.408200276084244e-05 seconds
write to file took 0.001433602999895811 seconds
Simulation took 6.190277079003863 secondes !

As I interpret it, the problem is with the solving step, that is to say the line: n, converged = self.solver.solve(self.w). It take 3 times longer with 10 procs than with only one, and we can see that same ratio in the total simulation time (2sec for 1 vs 6sec for 10 procs).

Well, I tested the non linear poisson tutorial, and it behaves like my code. So my code is correct, the thing is I miss something concerning multiprocessing with non linear Newton solver.

Running your code, I get the following results:
Serial

...
  Step number 8 #**********#
solving took 0.3430029360001754 seconds
assigning took 0.00025625400030548917 seconds
write to file took 0.0036635249998653308 seconds
#***********#
  Step number 9 #**********#
solving took 0.34020179100025416 seconds
assigning took 0.0003723680001712637 seconds
write to file took 0.0037930280000182393 seconds
Simulation took 2.44264921399963 secondes !

2 processes

#***********#
  Step number 8 #**********#
solving took 0.27043084999968414 seconds
assigning took 0.00011014599976988393 seconds
write to file took 0.0021438179996948747 seconds
#***********#
  Step number 9 #**********#
solving took 0.2770655029999034 seconds
assigning took 0.00013859999990017968 seconds
write to file took 0.002762994000022445 seconds
Simulation took 1.960063328999695 secondes !

4 processes

#***********#
  Step number 8 #**********#
solving took 0.16932072899999184 seconds
assigning took 0.00010603999999148073 seconds
write to file took 0.0021398339995357674 seconds
#***********#
  Step number 9 #**********#
solving took 0.16481978699994215 seconds
assigning took 0.00010450000081618782 seconds
write to file took 0.002268500000354834 seconds
Simulation took 1.1499761260001833 secondes !

8 procs:

#***********#
  Step number 8 #**********#
solving took 0.2175434189994121 seconds
assigning took 0.00014048400043975562 seconds
write to file took 0.004266792000635178 seconds
#***********#
  Step number 9 #**********#
solving took 0.1883403330002693 seconds
assigning took 0.00013021899940213189 seconds
write to file took 0.003587139999581268 seconds
Simulation took 1.426377312999648 secondes !

As you can see, there is a sweet-spot, where there is no use using more processes.

This is something that one should be aware of, as your problem is relatively small (only 45 000 dofs).
Using multi-processing is not free, you have to communicate data between processes (as the mesh is distributed, so is the functions, vectors and matrices), and with very few dofs on each process (~6000 in the case of 8 processes), you spend more time communicating with other processes than you gain in splitting the problem.

I’ve illustrated this in various settings, see for instance:

Ok thanks for your response, and yes, I was aware of the “sweet spot thing”, it is like in machine learning or a lot of optimization tasks, one must consider a trade-off.

I conclude that my code works with you (multiprocessing works faster until 8 procs), but not with me… I will try to findout what it is, but I am pretty sure it is about the Newton Solver.

Ok problem solved, it was not a problem with my code but with my computer, I added export OMP_NUM_THREADS=1 to my .bashrc, as in this post. And now it works fine with best performance at 8 processors, and definitely faster than in serial.

Thanks again for your help and quick responses, I am really grateful !