The problem is that you are running a too small problem to get any effect of using multiple processes. Since parallel computing requires communication, it is a balancing act to find how many dofs should be on each process.
For instance, if you increase the number of dofs in your problem by using a 500x500 unit square mesh (1 million dofs in your mixed space), the average time for using the krylov solver 10 times is 9.96 seconds in serial.
Using 2 procs it reduces to 9.086900045000002
However, increasing it to 3 procs, the time goes up to 9.6, 4 procs 9.5
So lets to the same with N=1000 (~4 million dofs):
- Serial: 39.657
- 2 processes: 29.05
- 3 processes: 25.4
- 4 processes: 24.33
Attached is the full timings for this case for serial and 2 processes:
Serial:
[MPI_AVG] Summary of timings | reps wall avg wall tot
----------------------------------------------------------------------------
Apply (PETScMatrix) | 3 0.026543 0.079629
Apply (PETScVector) | 37 7.6436e-06 0.00028281
Assemble cells | 7 3.3737 23.616
Build sparsity | 2 0.92588 1.8518
Compute SCOTCH graph re-ordering | 7 0.14492 1.0145
Compute connectivity 1-2 | 1 0.12209 0.12209
Compute entities dim = 1 | 1 0.73784 0.73784
Delete sparsity | 2 1.1605e-06 2.321e-06
DistributedMeshTools: reorder vertex values | 4 0.055076 0.2203
Init dof vector | 6 0.033172 0.19903
Init dofmap | 7 0.76472 5.353
Init dofmap from UFC dofmap | 7 0.14358 1.0051
Init tensor | 2 0.089218 0.17844
Number distributed mesh entities | 7 5.2843e-07 3.699e-06
PETSc Krylov solver | 3 13.185 39.554
SCOTCH: call SCOTCH_graphBuild | 7 0.00049532 0.0034673
SCOTCH: call SCOTCH_graphOrder | 7 0.1243 0.87011
2 Processes
Process 0: [MPI_AVG] Summary of timings | reps wall avg wall tot
-----------------------------------------------------------------------------------------------------
Apply (PETScMatrix) | 3 0.02795 0.08385
Apply (PETScVector) | 37 0.002464 0.091168
Assemble cells | 7 1.82 12.74
Build LocalMeshData from local Mesh | 1 0.17548 0.17548
Build distributed mesh from local mesh data | 1 2.6436 2.6436
Build local part of distributed mesh (from local mesh data) | 1 0.27132 0.27132
Build sparsity | 2 0.94024 1.8805
Compute SCOTCH graph re-ordering | 7 0.048955 0.34269
Compute connectivity 1-2 | 1 0.055547 0.055547
Compute entities dim = 1 | 1 0.37567 0.37567
Compute graph partition (SCOTCH) | 1 0.29117 0.29117
Compute local part of mesh dual graph | 1 0.30689 0.30689
Compute mesh entity ownership | 1 0.072827 0.072827
Compute non-local part of mesh dual graph | 1 0.0032135 0.0032135
Delete sparsity | 2 1.338e-06 2.676e-06
Distribute cells | 1 0.099223 0.099223
Distribute mesh (cells and vertices) | 1 0.7537 0.7537
Distribute vertices | 1 0.25513 0.25513
DistributedMeshTools: reorder vertex values | 12 0.033846 0.40615
Extract partition boundaries from SCOTCH graph | 1 0.0048294 0.0048294
Get SCOTCH graph data | 1 1.216e-06 1.216e-06
Init dof vector | 6 0.020221 0.12132
Init dofmap | 7 0.32877 2.3014
Init dofmap from UFC dofmap | 7 0.071951 0.50365
Init tensor | 2 0.054538 0.10908
Number distributed mesh entities | 8 0.065094 0.52075
Number mesh entities for distributed mesh (for specified vertex ids) | 1 0.51871 0.51871
PETSc Krylov solver | 3 9.6925 29.077
SCOTCH: call SCOTCH_dgraphBuild | 1 0.00071204 0.00071204
SCOTCH: call SCOTCH_dgraphHalo | 1 0.006222 0.006222
SCOTCH: call SCOTCH_dgraphPart | 1 0.27858 0.27858
SCOTCH: call SCOTCH_graphBuild | 7 0.00026688 0.0018682
SCOTCH: call SCOTCH_graphOrder | 7 0.040955 0.28669
Here you can observe that assemble cells is almost perfectly halfed, from 23.616 to 12.74, while the petsc krylov solver goes from 39 to 29 seconds, which could likely be improved by tuning the solver.