Testing weak scalability of parallel solver: cg + hypre_amg

I was trying to test the weak scalability of parallel computing using conjugate gradient descent as the linear solver and hypre_amg as the preconditioner by working on a simple 2D Poisson equation.

So what I did is solving the Poisson equation on a 2500 * 2500 unit mesh using 1 core, then solving on 5000 * 5000 unit mesh using 4 cores, then solving on 10000*10000 using 16 cores. I only measure the time taken to solve Ax = b in each case. Here is the link to the code.

Here is the result that I got

Issue 1: Since I am using hypre_amg, I was expecting them to spend similar amount of time. However, this is not the case.

Issue 2: I also tried solving on 5000*5000 unit mesh using only one core, in comparison to using 4 cores and found that the number of linear iterations are not the same in the two cases. Is this usual? Shouldn’t the number of iterations be the same using hypre_amg in the two cases?

Could anyone give me some hints please? Since I am not very familiar with parallel computing, please point out if I said something incorrect in the question.

Hi @georgexxu, did you manage to find out why? I have the same results with slightly different setup.

Hypre AMG should not be treated as a black box, and has various options that can be tweaked. As the iteration count is non-constant, you have your first issue.
You should also note that multigrid solvers should be supplied a near nullspace. See for instance:


for a setup that has a constant number of iterations with hypre-amg.
1 Like

Nope. But I still go ahead and test the weak and strong scalability of hypre_amg preconditioner in FEniCS. The speedup is around 2 times faster if you quadruples the number of cores, if I remember correctly. So it is still speeding up the program although not at the optimal speed that I was expecting.
Perhaps it is because we are using it as a black box, as mentioned in the next reply by Jorgen, so the optimal speedup is not achieved. But I am not familiar with PETsc, which I guess allows more control over the solvers.

The performance tests for dolfin and dolfin-x provide a suite for benchmarking your hardware. I’ve used these to consistently demonstrate good scaling. Hopefully they can help you pin down where your system, compilation or formulation is experiencing a bottleneck.

If possible, I recommend you benchmark with a native compilation against your system’s MPI before testing with python or a container.

1 Like

Thanks Nate, I have tested the setup of fenics on our university cluster and the scaling of the performance test is less than ideal. I tried to look up the figures for a representative performance from here for comparison however it was empty.

Is there any options of using Singularity to reproduce an ideal setup?

what is your setup and what iteration numbers do you get? Which setup in particular did you test?

You can consider the performance data for dolfinx here:
https://fenics.github.io/performance-test-results/index.html

1 Like

Thanks for the link! Glad to know that dolfin-x is working well.

I ran the performance test with dolfin to test weak scaling of a poisson equation. The partition scheme used is ParMETIS (as SCOTCH wasn’t available). The test was compiled natively.

mpirun -np 40 ./dolfin-scaling-test \
--problem_type poisson \
--scaling_type weak \
--ndofs 500000 \
--petsc.log_view \
--petsc.ksp_view \
--petsc.ksp_type cg \
--petsc.ksp_rtol 1.0e-8 \
--petsc.pc_type hypre \
--petsc.pc_hypre_type boomeramg \
--petsc.pc_hypre_boomeramg_strong_threshold 0.5 \
--petsc.options_left

Each node on the cluster has 40 cores, and I’ve tested it first for intra-node performance.

Number of iterations Assembly Solve
1 6 3.259 4.233
4 10 5.123 12.495
8 11 5.603 17.498
16 11 5.843 21.311
24 12 6.938 25.999
32 12 7.414 30.154
40 12 7.673 33.703