Understanding the timings of performance-tests

Hello,

I’m using GitHub - FEniCS/performance-test: Mini App for FEniCSx performance testing for my performance tests for the Google Summer of Code project detailed in Google Summer of Code

I’ve been trying to make sense of the timings table that is output. For example,

[MPI_MAX] Summary of timings                                                |  reps  wall avg  wall tot
-------------------------------------------------------------------------------------------------------
Build BoxMesh (tetrahedra)                                                  |     1  1.900000  1.900000
Build dofmap data                                                           |     2  0.060000  0.120000
Build sparsity                                                              |     1  0.050000  0.050000
Compute connectivity 2-0                                                    |     1  0.010000  0.010000
Compute dof reordering map                                                  |     2  0.005000  0.010000
Compute entities of dim = 2                                                 |     1  0.270000  0.270000
Compute graph partition (SCOTCH)                                            |     1  0.030000  0.030000
Compute local part of mesh dual graph                                       |     2  0.470000  0.940000
Compute local-to-local map                                                  |     1  0.010000  0.010000
Compute non-local part of mesh dual graph                                   |     1  0.000000  0.000000
Compute-local-to-global links for global/local adjacency list               |     1  0.010000  0.010000
Distribute AdjacencyList nodes to destination ranks                         |     1  0.070000  0.070000
Distribute row-wise data (scalable)                                         |     1  0.010000  0.010000
GPS: create_level_structure                                                 |     2  0.015000  0.030000
Gibbs-Poole-Stockmeyer ordering                                             |     1  0.090000  0.090000
Init MPI                                                                    |     1  0.080000  0.080000
Init PETSc                                                                  |     1  0.000000  0.000000
Init dofmap from element dofmap                                             |     2  0.050000  0.100000
Init logging                                                                |     1  0.000000  0.000000
PETSc Krylov solver                                                         |     1  0.960000  0.960000
SCOTCH: call SCOTCH_dgraphBuild                                             |     1  0.000000  0.000000
SCOTCH: call SCOTCH_dgraphPart                                              |     1  0.020000  0.020000
SparsityPattern::finalize                                                   |     1  0.080000  0.080000
Topology: create                                                            |     1  0.490000  0.490000
Topology: determine shared index ownership                                  |     1  0.010000  0.010000
Topology: determine vertex ownership groups (owned, undetermined, unowned)  |     1  0.080000  0.080000
ZZZ Assemble                                                                |     1  0.820000  0.820000
ZZZ Assemble matrix                                                         |     1  0.340000  0.340000
ZZZ Assemble vector                                                         |     1  0.090000  0.090000
ZZZ Create Mesh                                                             |     1  1.900000  1.900000
ZZZ Create RHS function                                                     |     1  0.240000  0.240000
ZZZ Create boundary conditions                                              |     1  0.010000  0.010000
ZZZ Create facets and facet->cell connectivity                              |     1  0.280000  0.280000
ZZZ FunctionSpace                                                           |     1  0.060000  0.060000
ZZZ Solve                                                                   |     1  0.960000  0.960000

I know that summing the columns is wrong because my wall total is a lot longer than what I’ve measured by hand the program takes to run (this is a single-core run.)

I would appreciate it if someone can detail what these numbers mean precisely – I can then also make a PR with this information to the documentation of performance-tests on GitHub.

Are you asking regarding the difference between wall time average and wall time total, or the meaning of the timers’ names?

@nate I believe the wall time average is the quotient of total by number of repetitions. I summed the total column but it did not produce the total time the program was running; it produced a much higher time interval, e.g. 10 seconds for a program that run for 1-2 seconds.

How should these events be put on a timeline?

I decided to go with the atexit time reported by loguru for an aggregate.

Apologies, I got distracted and this slipped my mind.

You can find the key source for the management of timers here: dolfinx/cpp/dolfinx/common/TimeLogger.cpp at main · FEniCS/dolfinx · GitHub

Specifically the difference between average and total as you described above: dolfinx/cpp/dolfinx/common/TimeLogger.cpp at ddc14f069263bdf98ec181d1027a5ad855f54071 · FEniCS/dolfinx · GitHub

There is a large number of timers employed in DOLFINx. These will overlap each other and not sequentially time a simulation. It’s typical to time the code with custom timers in the simulation script/main code itself, e.g.,

import dolfinx
# ...
total_timer = dolfinx.common.Timer("My total time")
# ...
total_timer.stop()
# ...
dolfinx.common.list_timings(...)

with any “subtimers” therein for further analysis.

You can see this in the performance tests, e.g., here. With results published here.

@nate Thanks for the response. I’m not solving problems with FEniCS, the project is based on running those performance tests you linked. If you could elaborate on the timings shown with that test, I could try to make a PR to include some additional documentation to performance-tests.

The salient timings for you are likely these:

These are called as described in the following tree

├──ZZZ Create Mesh
├──ZZZ Create facets and facet->cell connectivity
├──ZZZ FunctionSpace
├──ZZZ Assemble
│  ├──ZZZ Create boundary conditions
│  ├──ZZZ Create RHS function
│  ├──ZZZ Assemble matrix
│  └──ZZZ Assemble vector
├──ZZZ Solve 
├──ZZZ Output 

You should check this in the source code looking for anything with ZZZ and not simply take my word for it.

Summarising what these timings are measuring (there’s a lot more going on than what I’m writing below, but this will give you the gist):

  • ZZZ Create Mesh: Create the mesh to be used as the spatial discretisation of the domain in the FE problem
  • ZZZ Create facets and facet->cell connectivity Compute the topology connectivity of the mesh’s graph. I.e. compute the relationship between which cells are connected to each facet.
  • ZZZ FunctionSpace Create the function space in which the FE solution will be sought along with appropriate index maps for each degree of freedom and their relationship with the mesh.
  • ZZZ Assemble Encompassing timer for:
    • ZZZ Create boundary conditions: Find the mesh’s topological indices and corresponding degree of freedom indices on which to impose boundary data in a strong Dirichlet sense.
    • ZZZ Create RHS function: This is the step computing the function f or \vec{f} in the cases where -\nabla^2 u = f and -\nabla \cdot \mathbb{C} (u) = \vec{f} (Poisson and elasticity problems, respectively).
    • ZZZ Assemble matrix: Assemble the finite element matrix \mathrm{A} underlying finite element formulation, such that we seek to later solve \mathrm{A} \vec{x} = \vec{\mathrm{b}}
    • ZZZ Assemble vector: Likewise the time to assemble the right-hand-side vector \vec{\mathrm{b}}.
  • ZZZ Solve: The time to compute the solution of the linear system, \vec{x}. This is typically the dominant stage taking the greatest computational effort.
  • ZZZ Output: The time to postprocess and output any results to disk.
2 Likes

@nate Thank you that is very helpful. I will incorporate it into a documentation patch for performance-tests in the near future.