Speeding up Complex Electromagnetic simulations with dolfinx

Hi all,

I’m currently using fenics 2019.2.0 for solving Electromagnetic wave PDEs. Since this version of FEniCS does not have the possibility to directly handle complex numbers, for each variable I define two TrialFunctions corresponding to real and imaginary part of that variable.

Just to make an example, at the moment I’m solving a 2D problem, with a mesh containing around 4000 elements and 2000 nodes, I have 18 different TrialFunctions (3 vector fields * 3 components * 2, where the last 2 factor is due to the splitting in real and imaginary), and for each component I use cubic CG elements. In the end, I obtain about 300 000 DOFs. I’m using the MUMPS solver.

I have no problem in running the simulation successfully, however I noticed a remarkable difference between fenics 2019.2.0 and a commercial software in terms of computation time needed for solving this problem (40 seconds with fenics 2019.2.0 , 6 seconds with the commercial software). The number of elements for the mesh (4000) and the kind of elements used for solving the problem (cubic Lagrange) are the same in both cases, as well as the solver (MUMPS).

I tried many ways to lower the simulation time in fenics 2019.2.0, but none of them was successful.

My question is: could dolfinx speed up the simulation thanks to its support of complex numbers? Could there be other strategies of speeding up the simulation (e.g. changing MUMPS parameters and/or using MUMPS in MPI+OpenMP hybrid mode)?

Thank you very much :smiley:

1 Like

Such a large difference between solve times is bizarre if both packages are setting up and solving the exact same problem with the exact same methods and compiler/MPI configuration (particularly for MUMPS).

If you’re motivated, it may be worth pinpointing the slowest part of your simulation. You can investigate DOLFIN’s underlying timers by calling

    dolfin.list_timings(dolfin.TimingClear.clear, [dolfin.TimingType.wall])

And you can create your own timers to narrow down the expensive portion of your code with

with dolfin.Timer(f"Timer name: {name}"):
    do_something()
1 Like

I would also suggest checking your BLAS threading configuration. Sometimes MPI performance improves under environment variable OMP_NUM_THREADS=1. Also check the BLAS implementation installed and make sure an optimized BLAS such as OpenBLAS is accessible, not just the generic libblas.so.

2 Likes

Hi Nate, thanks for your reply.

I can guarantee that the problem is the same in terms of equations and discretization elements used. The meshes are really similar, that means that: 1) the geometry is the same, 2) the number of elements is almost the same (at most a difference of ~ 50 of elements), and 3) the mesh size spatial variations over the domain are very similar.

For the rest, I can guarantee that I’m using MUMPS in both cases, but not much else due to my limited experience and knowledge with these “advanced” features in FEM solvers. I guess I’m missing some important differences in the settings of the two, which eventually cause this remarkable time difference.

I used the timer at the end of my .py file as you specified ( dolfin.list_timings(dolfin.TimingClear.clear, [dolfin.TimingType.wall])), this is what I got:

[MPI_AVG] Summary of timings                 |  reps    wall avg    wall tot
----------------------------------------------------------------------------
Apply (PETScMatrix)                          |    52   0.0011801    0.061363
Apply (PETScVector)                          |   202  1.8447e-06  0.00037264
Assemble cells                               |     4      2.7282      10.913
Assemble system                              |    37    0.033308      1.2324
Build sparsity                               |    76     0.06545      4.9742
Compute SCOTCH graph re-ordering             |    38    0.002401    0.091239
Compute connectivity 1-2                     |     1  0.00010825  0.00010825
Compute connectivity 2-2                     |     1   0.0001815   0.0001815
Compute entities dim = 1                     |     1   0.0010095   0.0010095
Delete sparsity                              |    76  7.2532e-07  5.5124e-05
DirichletBC apply                            |    12   0.0025537    0.030645
DirichletBC compute bc                       |    12  3.2735e-05  0.00039282
DirichletBC init facets                      |    12  6.8018e-06  8.1621e-05
DistributedMeshTools: reorder vertex values  |     1  0.00019627  0.00019627
Init dof vector                              |    38  0.00050781    0.019297
Init dofmap                                  |    38    0.011918     0.45288
Init dofmap from UFC dofmap                  |    38   0.0008773    0.033337
Init tensor                                  |    76  0.00082284    0.062536
LU solver                                    |    37    0.085155      3.1507
Number distributed mesh entities             |    98  7.7169e-07  7.5626e-05
PETSc Krylov solver                          |    38     0.58682      22.299
SCOTCH: call SCOTCH_graphBuild               |    38  9.5944e-06  0.00036459
SCOTCH: call SCOTCH_graphOrder               |    38   0.0018095    0.068761
Solving linear system                        |    37     0.08525      3.1543

If I got it, the most expensive operations are Assemble Cells and PETSc Krylov solver (even if I do not get why it is using a Krylov solver…).

Just to clarify, this is a snippet with the final passages for solving the problem:


lagr_el = FiniteElement('CG', mesh.ufl_cell(), 3)

element = MixedElement([lagr_el, lagr_el, lagr_el, lagr_el, lagr_el, lagr_el,
                        lagr_el, lagr_el, lagr_el, lagr_el, lagr_el, lagr_el,
                        lagr_el, lagr_el, lagr_el, lagr_el, lagr_el, lagr_el])

V = FunctionSpace(mesh, element)

.
.
.

F = weak_form
a, L = lhs(F), rhs(F)

A = assemble(a, keep_diagonal=True)
b = assemble(L)

# bcs contain the boundary conditions
for bc in bcs: bc.apply(A, b)

A.ident_zeros()
sol = PETScLUSolver(method = "mumps")

x = Function(V)
sol.solve(A, x.vector(), b)

rEr, iEr, rEz, iEz, rEp, iEp, \
rPr, iPr, rPz, iPz, rPp, iPp, \
rFr, iFr, rFz, iFz, rFp, iFp = x.split()

Thank you very much, I will try with this approach and let you know :smiley:

From the tiny snippets of code you share here, it Seems like you are solving the problem multiple times (37 to be exact). Is this what you expect? It would help if you could create a minimal code that we could use to reproduce your behavior you are experiencing.

2 Likes

Hi @dparsons, within the shell where I run the .py file, I executed set OMP_NUM_THREADS = 1, but unfortunately the calculation time seems to have no benefit from that.

It seems that I don’t have OpenBLAS installed… How do I make OpenBLAS accessible to fenics after installation?

Hi @dokken , thanks for your reply.

I don’t know why the problem is solved multiple times… I’m using the sol.solve() function just a single time, as shown in the code. However, I’m using the project() function multiple times in my code (before using sol.solve()), maybe that has an influence.

I will try to produce a MWE where it seems that the problem is “solved” multiple times.

Also make sure that the environment variable is exported to subprocesses, e.g.

export OMP_NUM_THREADS=1

or export OMP_NUM_THREADS (exact syntax may depend on your shell)

It depends on how your system is configured. What are you using? On a Debian or Ubuntu-based system you would use something like

sudo apt-get install libopenblas-dev

There are several alternative variants. By default libopenblas-dev will pull in libopenblas-pthread-dev, but there is also libopenblas-openmp-dev (and libopenblas-serial-dev, but you probably don’t want that).

And there are other BLAS alternatives: libatlas-base-dev, libblis-pthread-dev (or libblis-openmp-dev). If necessary you can choose between them with

sudo update-alternatives --config libblas.so-x86_64-linux-gnu
sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu

My hypothesis here is that your commercial software has controlled a specific configuration for MUMPS and BLAS, possibly having them compiled in as static libraries. FEniCS would use the available shared libraries. If the solver algorithms are the same then that might account for the difference (if it’s not simply improved by OMP_NUM_THREADS). Obviously if your two programs are actually running different algorithms then you’ll get different timings.

2 Likes

Contrary to what I’ve said, Ubuntu notified me that OpenBLAS was already installed after having tried sudo apt-get install libopenblas-dev :sweat_smile: Therefore I assume that wasn’t the problem (if FEniCS is aware of the presence of OpenBLAS).

I tried with:

set OMP_NUM_THREADS=1
export OMP_NUM_THREADS=1
python3 em_simulation.py

But, again, the time is not improving, unfortunately.

A quick anwer to @dokken:

I simplified a bit the problem by diminishing the number of project() I’m using, and it seems to directly affect the number of reps, this is the new list of timings:

[MPI_AVG] Summary of timings      |  reps    wall avg    wall tot
-----------------------------------------------------------------
Apply (PETScMatrix)               |    27   0.0019024    0.051364
Apply (PETScVector)               |    77  1.5787e-06  0.00012156
Assemble cells                    |     3      3.6929      11.079
Assemble system                   |    12    0.044758      0.5371
Build sparsity                    |    26     0.18453      4.7977
Compute SCOTCH graph re-ordering  |    13   0.0013015    0.016919
Compute connectivity 1-2          |     1   0.0001873   0.0001873
Compute connectivity 2-2          |     1  0.00018368  0.00018368
Compute entities dim = 1          |     1   0.0014466   0.0014466
Delete sparsity                   |    26  5.8931e-07  1.5322e-05
DirichletBC apply                 |    12   0.0026973    0.032367
DirichletBC compute bc            |    12  3.2447e-05  0.00038936
DirichletBC init facets           |    12  6.6833e-06  8.0199e-05
Init dof vector                   |    13  0.00071159   0.0092507
Init dofmap                       |    13    0.007286    0.094718
Init dofmap from UFC dofmap       |    13  0.00051333   0.0066733
Init tensor                       |    26   0.0009936    0.025834
LU solver                         |    12    0.036339     0.43606
Number distributed mesh entities  |    23  1.1093e-06  2.5515e-05
PETSc Krylov solver               |    13      1.4923      19.399
SCOTCH: call SCOTCH_graphBuild    |    13  7.7072e-06  0.00010019
SCOTCH: call SCOTCH_graphOrder    |    13  0.00087092    0.011322
Solving linear system             |    12    0.036424     0.43709

In fact, now I’m using the project() function 11 times + 1 time the sol.solve() ( I wanted just to add this information while I’m preparing a proper MWE). Unfortunately, the time is not affected by this change.

Oh, wait. That means you’re running fenics on a single process. The OMP_NUM_THREADS=1 is intended to improve MPI parallel performance (mpirun -n 8 python3 em_simulation.py), it won’t help single process jobs. Usually we use MPI to get speed up via parallel computation.

It’s actually the other way around if you’re trying to run fenics on a single process. In that case you might want multiple threads to speed up the BLAS subsystem ( export OMP_NUM_THREADS=8).

If your commercial code is also running as a single process then perhaps it’s being clever with threads.

Without having a simple executable script, I don’t see how we can help much further with investigating your performance issues.

Hi all, sorry for the late reply. I wasn’t able to reduce too much the file I was working on, therefore I decided to switch to a simpler example which shows the same behavior. The MWE can be found at this link. The number of elements is purposely large for better showing the time difference.

I’ve reached the best performance (approximately 45 seconds) on my PC by running:

set OMP_NUM_THREADS=8
export OMP_NUM_THREADS=8
python3 em_simulation.py

while the commercial software takes 30 seconds (there is no remarkable difference in the number of mesh elements, and both software use quadratic discretization elements). The new list of timings is:

[MPI_AVG] Summary of timings      |  reps    wall avg    wall tot
-----------------------------------------------------------------
Apply (PETScMatrix)               |     1     0.10275     0.10275
Apply (PETScVector)               |     5  9.1398e-06  4.5699e-05
Assemble cells                    |     2      1.9116      3.8232
Assemble exterior facets          |     1     0.00831     0.00831
Build sparsity                    |     2     0.98752       1.975
Compute SCOTCH graph re-ordering  |     1     0.83521     0.83521
Compute connectivity 1-2          |     1    0.050942    0.050942
Compute connectivity 2-2          |     1    0.030628    0.030628
Compute entities dim = 1          |     1      0.2336      0.2336
Delete sparsity                   |     2    9.75e-07    1.95e-06
Init dof vector                   |     1    0.094641    0.094641
Init dofmap                       |     1      1.9994      1.9994
Init dofmap from UFC dofmap       |     1    0.071116    0.071116
Init tensor                       |     2     0.13739     0.27478
LU solver                         |     1      38.024      38.024
Number distributed mesh entities  |     2  0.00072921   0.0014584
PETSc Krylov solver               |     1      38.024      38.024
SCOTCH: call SCOTCH_graphBuild    |     1  0.00084793  0.00084793
SCOTCH: call SCOTCH_graphOrder    |     1     0.64771     0.64771

Sorry, I didn’t specify that I was not using MPI for running the simulation. Do you think it will speedup the simulation time? I’ve already done some tries, but they seemed not improving the time (instead, the time increased, even after having done set OMP_NUM_THREADS=1 and export OMP_NUM_THREADS=1.).

Obviously, I’m open to any suggestion.

OS: Ubuntu 20.04.2 LTS
Processor: Intel® Core™ i7-10700F CPU @ 2.90GHz × 16
RAM: 64GB

I ran a slightly modified version of your code (em_simulation.py · GitHub)
using docker, and with increasing number of mpi processes.
The only thing I’ve change is the usage of MPI.comm_self, to MPI.comm_world, to use mesh partitioning.

The results are as following:

One process:

PETSc Krylov solver               |     1      51.086      51.086

Two processes:

PETSc Krylov solver               |     1      43.321      43.321

Three processes:

PETSc Krylov solver               |     1      37.894      37.894

Four processes:

PETSc Krylov solver                                                   |     1      34.542      34.542

Five processes:

PETSc Krylov solver                                                   |     1      33.736      33.736

Eight processes:

PETSc Krylov solver                                                   |     1      35.197      35.197

As you observe, you hit a point where communicating data across processes is more expensive than solving the problem. But note that going from one MPI process to four, the solve runtime decreases from: 51.086 to 34.542 seconds.

If you look at assemble cells and build sparsity patterns, you will see an even better relative speedup.

3 Likes

I’d recommend using DOLFINx that has native support for complex numbers.
The algorithmic complexity of a sparse direct solver is at best O(n^2) in 3d. And since you use a Mixed Function space V = FunctionSpace(mesh, curl_el * curl_el),you double the number of dofs.
So I think you could get at least a 4x speedup in your solver just by using DOLFINx with complex support.

3 Likes

Solving a \backsim 6\times10^6 DoF sesquilinear problem in 38 seconds using MUMPS is pretty good. Perhaps the commercial software is using better assembly algorithms to push down on the 3.8 seconds in Assemble cells you’ve recorded. Old dolfin is notorious for inserting unnecessary zeros in mixed element formulations. dolfin-x has addressed many of these issues, and as @IgorBaratta states, support for the complex unit.

Also further to @IgorBaratta’s comments, the \mathcal{O}(n^2) complexity is for the FE discretisation of the 3D Poisson operator and \mathcal{O}(n^{\frac{3}{2}}) in 2D. The sparsity pattern of edge based DoFs messes with this though, and I don’t know the exact algorithmic complexity to be expected, but it’s likely similar.

3 Likes

Thanks @dokken ! I wasn’t able to run my initial simulation with MPI, but thanks to your snippet now I’m able to do it. From my first attempts, I’m noticing an impressive improvement in simulation time, and, even if it is still bigger than the time taken by my commercial software, I’m quite satisfied with this result. Just to give some numbers, on my personal laptop I get 22 seconds (FEniCS) vs 12 seconds (commercial), therefore I’m expecting something around 12 seconds (FEniCS) vs 6 seconds (commercial) on the workstation, that is almost 4 times faster than my initial time (40 seconds).

For further improvements, I will follow the suggestion by @IgorBaratta and @nate, that is switching to dolfinx (that was my initial candidate for improving the simulation time), even if this step will probably require an important amount of time due to the differences between the stable version of fenics and dolfinx.

For dolfinx, I would suggest having a look at:

1 Like