I was trying to test the weak scalability of parallel computing using conjugate gradient descent as the linear solver and hypre_amg as the preconditioner by working on a simple 2D Poisson equation.
So what I did is solving the Poisson equation on a 2500 * 2500 unit mesh using 1 core, then solving on 5000 * 5000 unit mesh using 4 cores, then solving on 10000*10000 using 16 cores. I only measure the time taken to solve Ax = b in each case. Here is the link to the code.
Issue 1: Since I am using hypre_amg, I was expecting them to spend similar amount of time. However, this is not the case.
Issue 2: I also tried solving on 5000*5000 unit mesh using only one core, in comparison to using 4 cores and found that the number of linear iterations are not the same in the two cases. Is this usual? Shouldn’t the number of iterations be the same using hypre_amg in the two cases?
Could anyone give me some hints please? Since I am not very familiar with parallel computing, please point out if I said something incorrect in the question.
Hypre AMG should not be treated as a black box, and has various options that can be tweaked. As the iteration count is non-constant, you have your first issue.
You should also note that multigrid solvers should be supplied a near nullspace. See for instance:
for a setup that has a constant number of iterations with hypre-amg.
Nope. But I still go ahead and test the weak and strong scalability of hypre_amg preconditioner in FEniCS. The speedup is around 2 times faster if you quadruples the number of cores, if I remember correctly. So it is still speeding up the program although not at the optimal speed that I was expecting.
Perhaps it is because we are using it as a black box, as mentioned in the next reply by Jorgen, so the optimal speedup is not achieved. But I am not familiar with PETsc, which I guess allows more control over the solvers.
The performance tests for dolfin and dolfin-x provide a suite for benchmarking your hardware. I’ve used these to consistently demonstrate good scaling. Hopefully they can help you pin down where your system, compilation or formulation is experiencing a bottleneck.
If possible, I recommend you benchmark with a native compilation against your system’s MPI before testing with python or a container.
Thanks Nate, I have tested the setup of fenics on our university cluster and the scaling of the performance test is less than ideal. I tried to look up the figures for a representative performance from here for comparison however it was empty.
Is there any options of using Singularity to reproduce an ideal setup?
Thanks for the link! Glad to know that dolfin-x is working well.
I ran the performance test with dolfin to test weak scaling of a poisson equation. The partition scheme used is ParMETIS (as SCOTCH wasn’t available). The test was compiled natively.