Multiprocessing FEniCSx making simulation slower

Running your code, I get the following results:
Serial

...
  Step number 8 #**********#
solving took 0.3430029360001754 seconds
assigning took 0.00025625400030548917 seconds
write to file took 0.0036635249998653308 seconds
#***********#
  Step number 9 #**********#
solving took 0.34020179100025416 seconds
assigning took 0.0003723680001712637 seconds
write to file took 0.0037930280000182393 seconds
Simulation took 2.44264921399963 secondes !

2 processes

#***********#
  Step number 8 #**********#
solving took 0.27043084999968414 seconds
assigning took 0.00011014599976988393 seconds
write to file took 0.0021438179996948747 seconds
#***********#
  Step number 9 #**********#
solving took 0.2770655029999034 seconds
assigning took 0.00013859999990017968 seconds
write to file took 0.002762994000022445 seconds
Simulation took 1.960063328999695 secondes !

4 processes

#***********#
  Step number 8 #**********#
solving took 0.16932072899999184 seconds
assigning took 0.00010603999999148073 seconds
write to file took 0.0021398339995357674 seconds
#***********#
  Step number 9 #**********#
solving took 0.16481978699994215 seconds
assigning took 0.00010450000081618782 seconds
write to file took 0.002268500000354834 seconds
Simulation took 1.1499761260001833 secondes !

8 procs:

#***********#
  Step number 8 #**********#
solving took 0.2175434189994121 seconds
assigning took 0.00014048400043975562 seconds
write to file took 0.004266792000635178 seconds
#***********#
  Step number 9 #**********#
solving took 0.1883403330002693 seconds
assigning took 0.00013021899940213189 seconds
write to file took 0.003587139999581268 seconds
Simulation took 1.426377312999648 secondes !

As you can see, there is a sweet-spot, where there is no use using more processes.

This is something that one should be aware of, as your problem is relatively small (only 45 000 dofs).
Using multi-processing is not free, you have to communicate data between processes (as the mesh is distributed, so is the functions, vectors and matrices), and with very few dofs on each process (~6000 in the case of 8 processes), you spend more time communicating with other processes than you gain in splitting the problem.

I’ve illustrated this in various settings, see for instance: