Nonlinear poisson as parallelisation example

mg-tub · January 22, 2025, 12:29pm

If we consider the nonlinear poisson example as a MWE for using NewtonSolver.

I increased the number of nodes by making the mesh 1000x1000. The code works well in parallel as-is. Even if I run it as a python script from the command-line without mpirun.

If I turn off the special solver options related to Alg. Multigrid, I don’t see this performing so well. What is going on here?

If I then take the problem and run it with e.g. mpirun -n 10 python ... the performance is always worse than directly running. Is there any general guidance to taking a NonlinearProblem and solving with NewtonSolver, so that one sees significant speedups by parallelising?

In my own systems, often variationally developed out of an energy functional (so highly nonlinear), I rely heavily on the nonlinear aspects, but I seem to notice that the Krylov solves are parallel and quick, but there is something happening on a single processor for a very long time before that. I tried seeing if solver.A.assemble() was taking a long time, but it doesn’t.

Any guidance here would be appreciated. I thought the nonlinear poisson might be a good talking point, because someone has already set it up to be parallel.

Stein · January 22, 2025, 2:55pm

I’m confused by your statement “The code works well in parallel as-is. Even if I run it as a python script from the command-line without mpirun”. If you don’t use mpirun, how can you say it works well in parallel?

In regards to the Algebraic Multigrid preconditioner:
Iterative solvers rarely work out of the box, and more often than not require specialized preconditioning to work effectively and scale efficiently. Simply turning off the preconditions (presumably while still using the iterative solver?) should not be expected to work well.

Relating to nonlinear solvers and mpi, this should certainly not be an issue. Parallelisation is one of FEniCSx’ core design principles. Are you actually running on a platform that has 10 cores?

In an earlier study (Precondition appears not to happen - #3 by Stein) I found that you have to set the ksp prefix to an empty string when manipulating the krylov solver within the Newtonsolver. Might that be related?

mg-tub · January 22, 2025, 3:40pm

Thanks for sharing your thoughts. To answer, that is just based on observing what happens when running e.g. from jupyter notebook or from the command line without mpirun. See below:

Clearly Petsc / hypre is doing some interesting background stuff here.

I’ll have a look at your post, but I don’t think there is anamolous behaviour here. I am asking the question in the sense of : "what happens in the background if I don’t give any Petsc options? why won’t that respond well to parallelisation with mpirun? "

I tried with and without mpirun, for a) no options, b) just specifying gmres. It is always worse to have mpirun going. I don’t think AMG is a prequisite to mpi working effectively so I am trying to ask my question independently of preconditioner considerations.

Stein · January 22, 2025, 3:48pm

I am confused why your program is already using 10 CPUs when you’re not using mpirun. If that is indeed the case, then clearly mpirun can’t benefit you further.

Is this an OMP_NUM_THREADS thing? I though OMP that was no longer used in dolfinx. Are you using legacy fenics? Could be completely off here.

For (a) no options means a direct solver, be sure to specify a direct solver that actually parallises. I.e., super_lu_dist or preferibly MUMPS
For (b), well, my point is that gmres without preconditioning will behave unpredictably and is not a valid study case.

mg-tub · January 22, 2025, 4:06pm

Perhaps you could try it for me with the tutorial file. I got fenicsx 0.9 via conda-forge. It should be a relatively representative case. If I run it without mpirun it runs in parallel anyway. All I did is set the mesh size larger. And that is exactly the point of my question. How do I control this? It runs in parallel anyway!

a) Direct solver, I didn’t realise. Is that documented anywhere? Where would you otherwise find that? Actually it is reporting “PETSc Krylov solver starting to solve…” so perhaps it has some automation when it realises the problem is a little too big.
b) I’m no expert but I don’t think the problem conditioning here is so bad that it would unpredictable without any preconditioning. Its just the Poisson equation. Also it solves fine without.

Here is a graph with gmres explicitly selected, but without the AMG/Hypre stuff. I run it without mpirun. You can see the krylov solves happen in parallel. They are the little bumps. It took just over 1 min 14 seconds (ignore time scale on graph, it is wrong for some reason)

Here is a version with mpirun -n 2 python .... It immediately maxes out everything. Solves with the same residuals in each Newton iteration. That waas 4 minutes.

Is this a bug maybe?

nate · January 22, 2025, 4:38pm

It’s difficult to give guidance without knowing exactly how your system is configured. What happens when you run this problem in one of the provided containers rather than your system configuration?

If I run it without mpirun it runs in parallel anyway.

This is bizarre.

For (a) no options means a direct solver, be sure to specify a direct solver that actually parallises. I.e., super_lu_dist or preferibly MUMPS

I’m not sure this is true. Setting no PETSc options typically defaults to gmres with block jacobi applying incomplete LU factorisation to each block. I could see how this would perform poorly with the Gateaux derivative of a nonlinear elliptic operator. Turning on -ksp_view will elucidate; or even -ksp_monitor which should yield one iteration per solve when a direct solver is applied as a preconditioner.

b) I’m no expert but I don’t think the problem conditioning here is so bad that it would unpredictable without any preconditioning. Its just the Poisson equation. Also it solves fine without.

Sadly, it’s not just the Poisson problem. It’s the Gateaux derivative thereof. It’s non-symmetric; and, depending on the coefficient q(u_h) could be “advection dominated”. This is problematic for some algebraic multigrid preconditioners.

Stein · January 23, 2025, 8:09am

Hmm, if (in the demo referenced in the OP) I make this code edit:

solver = NewtonSolver(MPI.COMM_WORLD, problem)
solver.convergence_criterion = "incremental"
solver.rtol = 1e-6
solver.report = True

# We can modify the linear solver in each Newton iteration by accessing the underlying `PETSc` object.

ksp = solver.krylov_solver
opts = PETSc.Options()
option_prefix = ksp.getOptionsPrefix()
opts[f"{option_prefix}ksp_view"] = None
# opts[f"{option_prefix}ksp_type"] = "gmres"
# opts[f"{option_prefix}ksp_rtol"] = 1.0e-8
# opts[f"{option_prefix}pc_type"] = "hypre"
# opts[f"{option_prefix}pc_hypre_type"] = "boomeramg"
# opts[f"{option_prefix}pc_hypre_boomeramg_max_iter"] = 1
# opts[f"{option_prefix}pc_hypre_boomeramg_cycle_type"] = "v"
ksp.setFromOptions()

Then I get this output:

2025-01-23 08:03:46.320 (   1.862s) [main            ]              petsc.cpp:700   INFO| PETSc Krylov solver starting to solve system.
KSP Object: (nls_solve_) 1 MPI process
  type: preonly
  maximum iterations=10000, initial guess is zero
  tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using NONE norm type for convergence test
PC Object: (nls_solve_) 1 MPI process
  type: lu
    out-of-place factorization
    tolerance for zero pivot 2.22045e-14
    matrix ordering: nd
    factor fill ratio given 5., needed 2.85283
      Factored matrix follows:
        Mat Object: (nls_solve_) 1 MPI process
          type: seqaij
          rows=121, cols=121
          package used to perform factorization: petsc
          total: nonzeros=2171, allocated nonzeros=2171
            not using I-node routines
  linear system matrix = precond matrix:
  Mat Object: 1 MPI process
    type: seqaij
    rows=121, cols=121
    total: nonzeros=761, allocated nonzeros=761
    total number of mallocs used during MatSetValues calls=0
      not using I-node routines

So KSP type='preonly' and PC type='lu' and MAT type=seqaij. Am I misinterpreting things?

EDIT:
If I run with mpirun, then MAT type defaults to MUMPS, so that appears to be allright. But @mg-tub you might want to test if this is also true on your system.

mg-tub · January 23, 2025, 9:10am

OK thanks for your replies. I think the results are all interesting. I tested this again on my home machine, which is also a debian one, but I installed via conda-forge on both. The behaviour is the same.

Firstly, regarding the default solver choice, PETSc lumps even the non Krylov solvers in KSP and calls them Krylov solvers! So even though I see

[info] PETSc Krylov solver starting to solve system.

actually with ksp_view turned on I get (edited)

type: preonly

like Stein. So it is direct. Even if the problem is large. That behaviour is the same on my system as for the docker image. Also it switches to MUMPS if I use mpirun.

mg-tub · January 23, 2025, 9:43am

I am running debian testing, or sid and have installed via conda-forge.

What happens when you run this problem in one of the provided containers rather than your system configuration?

I was surprised to see that the docker image behaves as one would expect. Both systems I tested used conda-forge, and I guess they have tweaked things in the PETSc packages leading to rather unexpected, and in the case of applying mpirun, erroneous behaviour.
So that explains the source of my problem somewhat, but I guess this is affecting a lot of users. Do you suspect it is the fenics packaging on conda-forge? Should I report it there? @francesco-ballarin, I think you are managing that right?

Point taken. It’s slightly more fun than watching paint dry, but probably because of the coefficient used in the example being nicely convex, it solves with GMRES without preconditioning. Interestingly it takes more than twice as many Newton iterations, which I’d guess shows it is ‘just’ satisfying the tolerances each time, where the preconditioned version probably overshoots them significantly.

Turning on -ksp_view will elucidate

Indeed it did. Also interesting to me was to learn that the restart number is 30, which I had been looking for everywhere, and to see this line:

maximum iterations=10000, initial guess is zero

Wouldn’t we see significant speed-ups by using the previous solution of the Newton iteration as an initial guess? If I were implementing myself I would have done this by default.

mg-tub · January 23, 2025, 9:47am

Oh no I wouldn’t. Time for coffee.

mg-tub · January 23, 2025, 2:04pm

Just to conclude things here. It seems the conda-forge packages for v0.9 (at least when installed on a debian testing and debian sid system) will run in parallel without using mpirun.

Speeding them up then with mpirun, then leads to erroneous behaviour.

If anyone would like to provide info for other systems to confirm or contradict that, it would be appreciated. I observed normal behaviour with the docker image.

dokken · January 23, 2025, 2:19pm

Running the slightly adapted code (with printing timings)

# ---
# jupyter:
#   jupytext:
#     formats: ipynb,py:light
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.16.5
#   kernelspec:
#     display_name: Python 3 (ipykernel)
#     language: python
#     name: python3
# ---

# # Implementation
#
# Author: Jørgen S. Dokken
#
# ## Test problem
# To solve a test problem, we need to choose the right hand side $f$, the coefficient $q(u)$, and the boundary $u_D$. Previously, we have worked with manufactured solutions that can  be reproduced without approximation errors. This is more difficult in nonlinear problems, and the algebra is more tedious. However, we will utilize the UFL differentiation capabilities to obtain a manufactured solution.
#
# For this problem, we will choose $q(u) = 1 + u^2$ and define a two dimensional manufactured solution that is linear in $x$ and $y$:

# +
import ufl
import numpy

from mpi4py import MPI
from petsc4py import PETSc

from dolfinx import mesh, fem, log
from dolfinx.fem.petsc import NonlinearProblem
from dolfinx.nls.petsc import NewtonSolver
from time import perf_counter

start = perf_counter()


def q(u):
    return 1 + u**2


N = 1000

domain = mesh.create_unit_square(MPI.COMM_WORLD, N, N)
x = ufl.SpatialCoordinate(domain)
u_ufl = 1 + x[0] + 2 * x[1]
f = -ufl.div(q(u_ufl) * ufl.grad(u_ufl))
# -

# Note that since `x` is a 2D vector, the first component (index 0) represents $x$, while the second component (index 1) represents $y$. The resulting function `f` can be directly used in variational formulations in DOLFINx.
#
# As we now have defined our source term and an exact solution, we can create the appropriate function space and boundary conditions.
# Note that as we have already defined the exact solution, we only have to convert it to a Python function that can be evaluated in the interpolation function. We do this by employing the Python `eval` and `lambda`-functions.

V = fem.functionspace(domain, ("Lagrange", 1))


def u_exact(x):
    return eval(str(u_ufl))


u_D = fem.Function(V)
u_D.interpolate(u_exact)
fdim = domain.topology.dim - 1
boundary_facets = mesh.locate_entities_boundary(
    domain, fdim, lambda x: numpy.full(x.shape[1], True, dtype=bool)
)
bc = fem.dirichletbc(u_D, fem.locate_dofs_topological(V, fdim, boundary_facets))

# We are now ready to define the variational formulation. Note that as the problem is nonlinear, we have to replace the `TrialFunction` with a `Function`, which serves as the unknown of our problem.

uh = fem.Function(V)
v = ufl.TestFunction(V)
F = q(uh) * ufl.dot(ufl.grad(uh), ufl.grad(v)) * ufl.dx - f * v * ufl.dx

# ## Newton's method
# The next step is to define the non-linear problem. As it is non-linear we will use [Newtons method](https://en.wikipedia.org/wiki/Newton%27s_method).
# For details about how to implement a Newton solver, see [Custom Newton solvers](../chapter4/newton-solver.ipynb).
# Newton's method requires methods for evaluating the residual `F` (including application of boundary conditions), as well as a method for computing the Jacobian matrix. DOLFINx provides the function `NonlinearProblem` that implements these methods. In addition to the boundary conditions, you can supply the variational form for the Jacobian (computed if not supplied), and form and jit parameters, see the [JIT parameters section](../chapter4/compiler_parameters.ipynb).

problem = NonlinearProblem(F, uh, bcs=[bc])

# Next, we use the DOLFINx Newton solver. We can set the convergence criteria for the solver by changing the absolute tolerance (`atol`), relative tolerance (`rtol`) or the convergence criterion (`residual` or `incremental`).

solver = NewtonSolver(MPI.COMM_WORLD, problem)
solver.convergence_criterion = "incremental"
solver.rtol = 1e-6
solver.report = True

# We can modify the linear solver in each Newton iteration by accessing the underlying `PETSc` object.

ksp = solver.krylov_solver
opts = PETSc.Options()
option_prefix = ksp.getOptionsPrefix()
opts[f"{option_prefix}ksp_type"] = "gmres"
opts[f"{option_prefix}ksp_rtol"] = 1.0e-8
opts[f"{option_prefix}pc_type"] = "hypre"
opts[f"{option_prefix}pc_hypre_type"] = "boomeramg"
opts[f"{option_prefix}pc_hypre_boomeramg_max_iter"] = 1
opts[f"{option_prefix}pc_hypre_boomeramg_cycle_type"] = "v"
ksp.setFromOptions()

# We are now ready to solve the non-linear problem. We assert that the solver has converged and print the number of iterations.

log.set_log_level(log.LogLevel.INFO)
n, converged = solver.solve(uh)
assert converged
print(f"Number of interations: {n:d}")

# We observe that the solver converges after $8$ iterations.
# If we think of the problem in terms of finite differences on a uniform mesh, $\mathcal{P}_1$ elements mimic standard second-order finite differences, which compute the derivative of a linear or quadratic funtion exactly. Here $\nabla u$ is a constant vector, which is multiplied by $1+u^2$, giving a second order polynomial in $x$ and $y$, which the finite difference operator would compute exactly. We can therefore, even with $\mathcal{P}_1$ elements, expect the manufactured solution to be reproduced by the numerical method. However, if we had chosen a nonlinearity, such as $1+u^4$, this would not be the case, and we would need to verify convergence rates.

# +
# Compute L2 error and error at nodes
V_ex = fem.functionspace(domain, ("Lagrange", 2))
u_ex = fem.Function(V_ex)
u_ex.interpolate(u_exact)
error_local = fem.assemble_scalar(fem.form((uh - u_ex) ** 2 * ufl.dx))
error_L2 = numpy.sqrt(domain.comm.allreduce(error_local, op=MPI.SUM))
if domain.comm.rank == 0:
    print(f"L2-error: {error_L2:.2e}")

# Compute values at mesh vertices
error_max = domain.comm.allreduce(
    numpy.max(numpy.abs(uh.x.array - u_D.x.array)), op=MPI.MAX
)
if domain.comm.rank == 0:
    print(f"Error_max: {error_max:.2e}")
end = perf_counter()
if MPI.COMM_WORLD.rank == 0:
    print(f"{MPI.COMM_WORLD.size} -Time: {end - start:.2f}")

executed on ubuntu 22.04 with conda-forge (python 3.12), and the following env export:

name: test_scaling
channels:
  - conda-forge
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - attr=2.5.1=h166bdaf_1
  - binutils_impl_linux-64=2.43=h4bf12b8_2
  - binutils_linux-64=2.43=h4852527_2
  - blis=0.9.0=h4ab18f5_2
  - bzip2=1.0.8=h4bc722e_7
  - c-ares=1.34.4=hb9d3cd8_0
  - c-blosc2=2.15.2=h3122c55_1
  - ca-certificates=2024.12.14=hbcca054_0
  - cffi=1.16.0=py312hf06ca03_0
  - fenics-basix=0.9.0=py312h9c9c0ab_2
  - fenics-basix-nanobind-abi=0.2.1.13=h6c05e69_2
  - fenics-dolfinx=0.9.0=py312hef1a67e_108
  - fenics-ffcx=0.9.0=pyh2e48890_0
  - fenics-libbasix=0.9.0=h7cb7ce6_2
  - fenics-libdolfinx=0.9.0=hb85e8c2_108
  - fenics-ufcx=0.9.0=hb7f7608_0
  - fenics-ufl=2024.2.0=pyhd8ed1ab_1
  - fftw=3.3.10=mpi_mpich_hbcf76dd_10
  - fmt=11.0.2=h434a139_0
  - gcc_impl_linux-64=13.3.0=hfea6d02_1
  - gcc_linux-64=13.3.0=hc28eda2_7
  - hdf5=1.14.3=mpi_mpich_h7f58efa_9
  - hypre=2.32.0=mpi_mpich_h2e71eac_1
  - icu=75.1=he02047a_0
  - kahip=3.18=h7d9e1f9_0
  - kernel-headers_linux-64=3.10.0=he073ed8_18
  - keyutils=1.6.1=h166bdaf_0
  - krb5=1.21.3=h659f571_0
  - ld_impl_linux-64=2.43=h712a8e2_2
  - libadios2=2.10.2=mpi_mpich_hd47ee72_1
  - libaec=1.1.3=h59595ed_0
  - libamd=3.3.3=ss783_h889e182
  - libblas=3.9.0=26_linux64_blis
  - libboost=1.86.0=h6c02f8c_3
  - libboost-devel=1.86.0=h1a2810e_3
  - libboost-headers=1.86.0=ha770c72_3
  - libbtf=2.3.2=ss783_h2377355
  - libcamd=3.3.3=ss783_h2377355
  - libcap=2.71=h39aace5_0
  - libcblas=3.9.0=26_linux64_blis
  - libccolamd=3.3.4=ss783_h2377355
  - libcholmod=5.3.0=ss783_h3fa60b6
  - libcolamd=3.3.4=ss783_h2377355
  - libcurl=8.11.1=h332b0f4_0
  - libedit=3.1.20240808=pl5321h7949ede_0
  - libev=4.33=hd590300_2
  - libexpat=2.6.4=h5888daf_0
  - libfabric=2.0.0=ha770c72_1
  - libfabric1=2.0.0=h14e6f36_1
  - libffi=3.4.2=h7f98852_5
  - libgcc=14.2.0=h77fa898_1
  - libgcc-devel_linux-64=13.3.0=h84ea5a7_101
  - libgcc-ng=14.2.0=h69a702a_1
  - libgcrypt-lib=1.11.0=hb9d3cd8_2
  - libgfortran=14.2.0=h69a702a_1
  - libgfortran-ng=14.2.0=h69a702a_1
  - libgfortran5=14.2.0=hd5240d6_1
  - libgomp=14.2.0=h77fa898_1
  - libgpg-error=1.51=hbd13f7d_1
  - libhwloc=2.11.2=default_h0d58e46_1001
  - libiconv=1.17=hd590300_2
  - libklu=2.3.5=ss783_hfbdfdfc
  - liblapack=3.9.0=8_h3b12eaf_netlib
  - liblzma=5.6.3=hb9d3cd8_1
  - libnghttp2=1.64.0=h161d5f1_0
  - libnl=3.11.0=hb9d3cd8_0
  - libnsl=2.0.1=hd590300_0
  - libpng=1.6.45=h943b412_0
  - libptscotch=7.0.6=h4c3caac_1
  - libsanitizer=13.3.0=heb74ff8_1
  - libscotch=7.0.6=hea33c07_1
  - libsodium=1.0.20=h4ab18f5_0
  - libspqr=4.3.4=ss783_hae1ff0d
  - libsqlite=3.48.0=hee588c1_1
  - libssh2=1.11.1=hf672d98_0
  - libstdcxx=14.2.0=hc0a3c3a_1
  - libstdcxx-ng=14.2.0=h4852527_1
  - libsuitesparseconfig=7.8.3=ss783_h83006af
  - libsystemd0=257.2=h3dc2cb9_0
  - libudev1=257.2=h9a4d06a_0
  - libumfpack=6.3.5=ss783_hd4f9ce1
  - libuuid=2.38.1=h0b41bf4_0
  - libxcrypt=4.4.36=hd590300_1
  - libxml2=2.13.5=h8d12d68_1
  - libzlib=1.3.1=hb9d3cd8_2
  - lz4-c=1.10.0=h5888daf_1
  - metis=5.1.0=hd0bcaf9_1007
  - mpi=1.0.1=mpich
  - mpi4py=4.0.1=py312h0a6c937_1
  - mpich=4.2.3=h1a8bee6_104
  - mumps-include=5.7.3=ha770c72_6
  - mumps-mpi=5.7.3=h2e1f7a5_6
  - ncurses=6.5=h2d0b736_2
  - numpy=2.2.2=py312h72c5963_0
  - openssl=3.4.0=h7b32b05_1
  - parmetis=4.0.3=hc7bef4e_1007
  - petsc=3.22.2=real_h91a077e_103
  - petsc4py=3.22.2=py312h84d7c54_0
  - pip=24.3.1=pyh8b19718_2
  - pkg-config=0.29.2=h4bc722e_1009
  - pugixml=1.14=h59595ed_0
  - pycparser=2.22=pyh29332c3_1
  - python=3.12.8=h9e4cc4f_1_cpython
  - python_abi=3.12=5_cp312
  - rdma-core=55.0=h5888daf_0
  - readline=8.2=h8228510_1
  - scalapack=2.2.0=h7e29ba8_4
  - setuptools=75.8.0=pyhff2d567_0
  - slepc=3.22.2=real_h754b140_300
  - slepc4py=3.22.2=py312h377abe1_0
  - spdlog=1.14.1=hed91bc2_1
  - superlu=5.2.2=h00795ac_0
  - superlu_dist=9.1.0=h0804ebd_0
  - sysroot_linux-64=2.17=h0157908_18
  - tk=8.6.13=noxft_h4845f30_101
  - tzdata=2025a=h78e105d_0
  - ucx=1.18.0=h53fb5aa_0
  - wheel=0.45.1=pyhd8ed1ab_1
  - yaml=0.2.5=h7f98852_2
  - zeromq=4.3.5=h3b0a872_7
  - zfp=0.5.5=h9c3ff4c_8
  - zlib-ng=2.2.3=h7955e40_0
  - zstd=1.5.6=ha6fb4c9_0

yields

# Serial
L2-error: 6.67e-16
Error_max: 4.44e-15
1 -Time: 16.94
#  2 proc
L2-error: 6.05e-16
Error_max: 4.88e-15
2 -Time: 12.98

# 4 proc
L2-error: 6.51e-16
Error_max: 5.33e-15
4 -Time: 9.46

# 8 proc
Number of interations: 8
L2-error: 6.00e-16
Error_max: 4.88e-15
8 -Time: 10.16

which clearly indicates that there is a sweet spot in terms of partitioning when you get around 4 processors. I’ve already commented on this in several other places (including recently in MPI acceleration with FEniCSx - #3 by dokken).

mg-tub · January 23, 2025, 2:32pm

Thanks. I’ll test the same. I’m aware that multitasking doesn’t always speed things up, but I was referring to truly weird behaviour. mpirun -n 2 ... uses all processors.

dokken · January 23, 2025, 2:37pm

That must be something with the given installation on the given system, and one would need at least a conda env export to be able to say anything about what could be wrong.

mg-tub · January 23, 2025, 4:05pm

I tried it with a new environment and get the same behaviour. It seems like PETSc is choosing to do things in parallel by itself. Could you have a look at your processor activity while running that in serial mode? The Hypre AMG PC certainly gets my processor loaded on all cores. If I turn it off and just use gmres with default PC I get the same behaviour as above. I.e. in serial mode it is indeed serial with little blips of fully parallel activity that correspond to solves, possibly the PC calculation is parallelised?

I wonder if other libraries like deal.II are installing dependencies that might activate this behaviour. Or is everything in the environment self-contained, including PETSc libraries?

name: fenx25.9
channels:
- conda-forge
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_kmp_llvm
- alsa-lib=1.2.13=hb9d3cd8_0
- anyio=4.8.0=pyhd8ed1ab_0
- argon2-cffi=23.1.0=pyhd8ed1ab_1
- argon2-cffi-bindings=21.2.0=py312h66e93f0_5
- arrow=1.3.0=pyhd8ed1ab_1
- asttokens=3.0.0=pyhd8ed1ab_1
- async-lru=2.0.4=pyhd8ed1ab_1
- attr=2.5.1=h166bdaf_1
- attrs=24.3.0=pyh71513ae_0
- babel=2.16.0=pyhd8ed1ab_1
- beautifulsoup4=4.12.3=pyha770c72_1
- binutils_impl_linux-64=2.43=h4bf12b8_2
- binutils_linux-64=2.43=h4852527_2
- bleach=6.2.0=pyh29332c3_4
- bleach-with-css=6.2.0=h82add2a_4
- brotli=1.1.0=hb9d3cd8_2
- brotli-bin=1.1.0=hb9d3cd8_2
- brotli-python=1.1.0=py312h2ec8cdc_2
- bzip2=1.0.8=h4bc722e_7
- c-ares=1.34.4=hb9d3cd8_0
- c-blosc2=2.15.2=h3122c55_1
- ca-certificates=2024.12.14=hbcca054_0
- cached-property=1.5.2=hd8ed1ab_1
- cached_property=1.5.2=pyha770c72_1
- cairo=1.18.2=h3394656_1
- certifi=2024.12.14=pyhd8ed1ab_0
- cffi=1.16.0=py312hf06ca03_0
- charset-normalizer=3.4.1=pyhd8ed1ab_0
- comm=0.2.2=pyhd8ed1ab_1
- contourpy=1.3.1=py312h68727a3_0
- cycler=0.12.1=pyhd8ed1ab_1
- cyrus-sasl=2.1.27=h54b06d7_7
- dbus=1.13.6=h5008d03_3
- debugpy=1.8.12=py312h2ec8cdc_0
- decorator=5.1.1=pyhd8ed1ab_1
- defusedxml=0.7.1=pyhd8ed1ab_0
- double-conversion=3.3.0=h59595ed_0
- entrypoints=0.4=pyhd8ed1ab_1
- exceptiongroup=1.2.2=pyhd8ed1ab_1
- executing=2.1.0=pyhd8ed1ab_1
- expat=2.6.4=h5888daf_0
- fenics-basix=0.9.0=py312h9c9c0ab_2
- fenics-basix-nanobind-abi=0.2.1.13=h6c05e69_2
- fenics-dolfinx=0.9.0=py312hef1a67e_108
- fenics-ffcx=0.9.0=pyh2e48890_0
- fenics-libbasix=0.9.0=h7cb7ce6_2
- fenics-libdolfinx=0.9.0=hb85e8c2_108
- fenics-ufcx=0.9.0=hb7f7608_0
- fenics-ufl=2024.2.0=pyhd8ed1ab_1
- fftw=3.3.10=mpi_mpich_hbcf76dd_10
- fmt=11.0.2=h434a139_0
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
- font-ttf-ubuntu=0.83=h77eed37_3
- fontconfig=2.15.0=h7e30c49_1
- fonts-conda-ecosystem=1=0
- fonts-conda-forge=1=0
- fonttools=4.55.4=py312h178313f_0
- fqdn=1.5.1=pyhd8ed1ab_1
- freetype=2.12.1=h267a509_2
- gcc_impl_linux-64=13.3.0=hfea6d02_1
- gcc_linux-64=13.3.0=hc28eda2_7
- graphite2=1.3.13=h59595ed_1003
- h11=0.14.0=pyhd8ed1ab_1
- h2=4.1.0=pyhd8ed1ab_1
- harfbuzz=10.2.0=h4bba637_0
- hdf5=1.14.3=mpi_mpich_h7f58efa_9
- hpack=4.1.0=pyhd8ed1ab_0
- httpcore=1.0.7=pyh29332c3_1
- httpx=0.28.1=pyhd8ed1ab_0
- hyperframe=6.1.0=pyhd8ed1ab_0
- hypre=2.32.0=mpi_mpich_h2e71eac_1
- icu=75.1=he02047a_0
- idna=3.10=pyhd8ed1ab_1
- importlib-metadata=8.6.1=pyha770c72_0
- importlib_resources=6.5.2=pyhd8ed1ab_0
- ipykernel=6.29.5=pyh3099207_0
- ipython=8.31.0=pyh707e725_0
- ipywidgets=8.1.5=pyhd8ed1ab_1
- isoduration=20.11.0=pyhd8ed1ab_1
- jedi=0.19.2=pyhd8ed1ab_1
- jinja2=3.1.5=pyhd8ed1ab_0
- json5=0.10.0=pyhd8ed1ab_1
- jsonpointer=3.0.0=py312h7900ff3_1
- jsonschema=4.23.0=pyhd8ed1ab_1
- jsonschema-specifications=2024.10.1=pyhd8ed1ab_1
- jsonschema-with-format-nongpl=4.23.0=hd8ed1ab_1
- jupyter=1.1.1=pyhd8ed1ab_1
- jupyter-lsp=2.2.5=pyhd8ed1ab_1
- jupyter_client=8.6.3=pyhd8ed1ab_1
- jupyter_console=6.6.3=pyhd8ed1ab_1
- jupyter_core=5.7.2=pyh31011fe_1
- jupyter_events=0.11.0=pyhd8ed1ab_0
- jupyter_server=2.15.0=pyhd8ed1ab_0
- jupyter_server_terminals=0.5.3=pyhd8ed1ab_1
- jupyterlab=4.3.4=pyhd8ed1ab_0
- jupyterlab_pygments=0.3.0=pyhd8ed1ab_2
- jupyterlab_server=2.27.3=pyhd8ed1ab_1
- jupyterlab_widgets=3.0.13=pyhd8ed1ab_1
- kahip=3.18=h7d9e1f9_0
- kernel-headers_linux-64=3.10.0=he073ed8_18
- keyutils=1.6.1=h166bdaf_0
- kiwisolver=1.4.8=py312h84d6215_0
- krb5=1.21.3=h659f571_0
- lcms2=2.16=hb7c19ff_0
- ld_impl_linux-64=2.43=h712a8e2_2
- lerc=4.0.0=h27087fc_0
- libadios2=2.10.2=mpi_mpich_hd47ee72_1
- libaec=1.1.3=h59595ed_0
- libamd=3.3.3=ss783_h889e182
- libblas=3.9.0=26_linux64_openblas
- libboost=1.86.0=h6c02f8c_3
- libboost-devel=1.86.0=h1a2810e_3
- libboost-headers=1.86.0=ha770c72_3
- libbrotlicommon=1.1.0=hb9d3cd8_2
- libbrotlidec=1.1.0=hb9d3cd8_2
- libbrotlienc=1.1.0=hb9d3cd8_2
- libbtf=2.3.2=ss783_h2377355
- libcamd=3.3.3=ss783_h2377355
- libcap=2.71=h39aace5_0
- libcblas=3.9.0=26_linux64_openblas
- libccolamd=3.3.4=ss783_h2377355
- libcholmod=5.3.0=ss783_h3fa60b6
- libclang-cpp19.1=19.1.7=default_hb5137d0_0
- libclang13=19.1.7=default_h9c6a7e4_0
- libcolamd=3.3.4=ss783_h2377355
- libcups=2.3.3=h4637d8d_4
- libcurl=8.11.1=h332b0f4_0
- libdeflate=1.23=h4ddbbb0_0
- libdrm=2.4.124=hb9d3cd8_0
- libedit=3.1.20240808=pl5321h7949ede_0
- libegl=1.7.0=ha4b6fd6_2
- libev=4.33=hd590300_2
- libexpat=2.6.4=h5888daf_0
- libfabric=2.0.0=ha770c72_1
- libfabric1=2.0.0=h14e6f36_1
- libffi=3.4.2=h7f98852_5
- libgcc=14.2.0=h77fa898_1
- libgcc-devel_linux-64=13.3.0=h84ea5a7_101
- libgcc-ng=14.2.0=h69a702a_1
- libgcrypt-lib=1.11.0=hb9d3cd8_2
- libgfortran=14.2.0=h69a702a_1
- libgfortran-ng=14.2.0=h69a702a_1
- libgfortran5=14.2.0=hd5240d6_1
- libgl=1.7.0=ha4b6fd6_2
- libglib=2.82.2=h2ff4ddf_1
- libglvnd=1.7.0=ha4b6fd6_2
- libglx=1.7.0=ha4b6fd6_2
- libgomp=14.2.0=h77fa898_1
- libgpg-error=1.51=hbd13f7d_1
- libhwloc=2.11.2=default_h0d58e46_1001
- libiconv=1.17=hd590300_2
- libjpeg-turbo=3.0.0=hd590300_1
- libklu=2.3.5=ss783_hfbdfdfc
- liblapack=3.9.0=26_linux64_openblas
- libllvm19=19.1.7=ha7bfdaf_0
- liblzma=5.6.3=hb9d3cd8_1
- libnghttp2=1.64.0=h161d5f1_0
- libnl=3.11.0=hb9d3cd8_0
- libnsl=2.0.1=hd590300_0
- libntlm=1.8=hb9d3cd8_0
- libopenblas=0.3.28=openmp_hd680484_1
- libopengl=1.7.0=ha4b6fd6_2
- libpciaccess=0.18=hd590300_0
- libpng=1.6.45=h943b412_0
- libpq=17.2=h3b95a9b_1
- libptscotch=7.0.6=h4c3caac_1
- libsanitizer=13.3.0=heb74ff8_1
- libscotch=7.0.6=hea33c07_1
- libsodium=1.0.20=h4ab18f5_0
- libspqr=4.3.4=ss783_hae1ff0d
- libsqlite=3.48.0=hee588c1_1
- libssh2=1.11.1=hf672d98_0
- libstdcxx=14.2.0=hc0a3c3a_1
- libstdcxx-ng=14.2.0=h4852527_1
- libsuitesparseconfig=7.8.3=ss783_h83006af
- libsystemd0=257.2=h3dc2cb9_0
- libtiff=4.7.0=hd9ff511_3
- libudev1=257.2=h9a4d06a_0
- libumfpack=6.3.5=ss783_hd4f9ce1
- libuuid=2.38.1=h0b41bf4_0
- libwebp-base=1.5.0=h851e524_0
- libxcb=1.17.0=h8a09558_0
- libxcrypt=4.4.36=hd590300_1
- libxkbcommon=1.7.0=h2c5496b_1
- libxml2=2.13.5=h8d12d68_1
- libxslt=1.1.39=h76b75d6_0
- libzlib=1.3.1=hb9d3cd8_2
- llvm-openmp=19.1.7=h024ca30_0
- lz4-c=1.10.0=h5888daf_1
- markupsafe=3.0.2=py312h178313f_1
- matplotlib=3.10.0=py312h7900ff3_0
- matplotlib-base=3.10.0=py312hd3ec401_0
- matplotlib-inline=0.1.7=pyhd8ed1ab_1
- metis=5.1.0=hd0bcaf9_1007
- mistune=3.1.0=pyhd8ed1ab_0
- mpi=1.0.1=mpich
- mpi4py=4.0.1=py312h0a6c937_1
- mpich=4.2.3=h1a8bee6_104
- mumps-include=5.7.3=ha770c72_6
- mumps-mpi=5.7.3=h2e1f7a5_6
- munkres=1.1.4=pyh9f0ad1d_0
- mysql-common=9.0.1=h266115a_4
- mysql-libs=9.0.1=he0572af_4
- nbclient=0.10.2=pyhd8ed1ab_0
- nbconvert-core=7.16.5=pyhd8ed1ab_1
- nbformat=5.10.4=pyhd8ed1ab_1
- ncurses=6.5=h2d0b736_2
- nest-asyncio=1.6.0=pyhd8ed1ab_1
- notebook=7.3.2=pyhd8ed1ab_0
- notebook-shim=0.2.4=pyhd8ed1ab_1
- numpy=2.2.2=py312h72c5963_0
- openjpeg=2.5.3=h5fbd93e_0
- openldap=2.6.9=he970967_0
- openssl=3.4.0=h7b32b05_1
- overrides=7.7.0=pyhd8ed1ab_1
- packaging=24.2=pyhd8ed1ab_2
- pandocfilters=1.5.0=pyhd8ed1ab_0
- parmetis=4.0.3=hc7bef4e_1007
- parso=0.8.4=pyhd8ed1ab_1
- pcre2=10.44=hba22ea6_2
- petsc=3.22.2=real_h91a077e_103
- petsc4py=3.22.2=py312h84d7c54_0
- pexpect=4.9.0=pyhd8ed1ab_1
- pickleshare=0.7.5=pyhd8ed1ab_1004
- pillow=11.1.0=py312h80c1187_0
- pip=24.3.1=pyh8b19718_2
- pixman=0.44.2=h29eaf8c_0
- pkg-config=0.29.2=h4bc722e_1009
- pkgutil-resolve-name=1.3.10=pyhd8ed1ab_2
- platformdirs=4.3.6=pyhd8ed1ab_1
- prometheus_client=0.21.1=pyhd8ed1ab_0
- prompt-toolkit=3.0.50=pyha770c72_0
- prompt_toolkit=3.0.50=hd8ed1ab_0
- psutil=6.1.1=py312h66e93f0_0
- pthread-stubs=0.4=hb9d3cd8_1002
- ptyprocess=0.7.0=pyhd8ed1ab_1
- pugixml=1.14=h59595ed_0
- pure_eval=0.2.3=pyhd8ed1ab_1
- pycparser=2.22=pyh29332c3_1
- pygments=2.19.1=pyhd8ed1ab_0
- pyparsing=3.2.1=pyhd8ed1ab_0
- pyside6=6.8.1=py312h91f0f75_0
- pysocks=1.7.1=pyha55dd90_7
- python=3.12.8=h9e4cc4f_1_cpython
- python-dateutil=2.9.0.post0=pyhff2d567_1
- python-fastjsonschema=2.21.1=pyhd8ed1ab_0
- python-json-logger=2.0.7=pyhd8ed1ab_0
- python_abi=3.12=5_cp312
- pytz=2024.2=pyhd8ed1ab_1
- pyyaml=6.0.2=py312h178313f_2
- pyzmq=26.2.0=py312hbf22597_3
- qhull=2020.2=h434a139_5
- qt6-main=6.8.1=h588cce1_2
- rdma-core=55.0=h5888daf_0
- readline=8.2=h8228510_1
- referencing=0.36.1=pyhd8ed1ab_0
- requests=2.32.3=pyhd8ed1ab_1
- rfc3339-validator=0.1.4=pyhd8ed1ab_1
- rfc3986-validator=0.1.1=pyh9f0ad1d_0
- rpds-py=0.22.3=py312h12e396e_0
- scalapack=2.2.0=h7e29ba8_4
- scipy=1.15.1=py312h180e4f1_0
- send2trash=1.8.3=pyh0d859eb_1
- setuptools=75.8.0=pyhff2d567_0
- six=1.17.0=pyhd8ed1ab_0
- slepc=3.22.2=real_h754b140_300
- slepc4py=3.22.2=py312h377abe1_0
- sniffio=1.3.1=pyhd8ed1ab_1
- soupsieve=2.5=pyhd8ed1ab_1
- spdlog=1.14.1=hed91bc2_1
- stack_data=0.6.3=pyhd8ed1ab_1
- superlu=5.2.2=h00795ac_0
- superlu_dist=9.1.0=h0804ebd_0
- sysroot_linux-64=2.17=h0157908_18
- terminado=0.18.1=pyh0d859eb_0
- tinycss2=1.4.0=pyhd8ed1ab_0
- tk=8.6.13=noxft_h4845f30_101
- tomli=2.2.1=pyhd8ed1ab_1
- tornado=6.4.2=py312h66e93f0_0
- traitlets=5.14.3=pyhd8ed1ab_1
- types-python-dateutil=2.9.0.20241206=pyhd8ed1ab_0
- typing-extensions=4.12.2=hd8ed1ab_1
- typing_extensions=4.12.2=pyha770c72_1
- typing_utils=0.1.0=pyhd8ed1ab_1
- tzdata=2025a=h78e105d_0
- ucx=1.18.0=h53fb5aa_0
- unicodedata2=16.0.0=py312h66e93f0_0
- uri-template=1.3.0=pyhd8ed1ab_1
- urllib3=2.3.0=pyhd8ed1ab_0
- wayland=1.23.1=h3e06ad9_0
- wcwidth=0.2.13=pyhd8ed1ab_1
- webcolors=24.11.1=pyhd8ed1ab_0
- webencodings=0.5.1=pyhd8ed1ab_3
- websocket-client=1.8.0=pyhd8ed1ab_1
- wheel=0.45.1=pyhd8ed1ab_1
- widgetsnbextension=4.0.13=pyhd8ed1ab_1
- xcb-util=0.4.1=hb711507_2
- xcb-util-cursor=0.1.5=hb9d3cd8_0
- xcb-util-image=0.4.0=hb711507_2
- xcb-util-keysyms=0.4.1=hb711507_0
- xcb-util-renderutil=0.3.10=hb711507_0
- xcb-util-wm=0.4.2=hb711507_0
- xkeyboard-config=2.43=hb9d3cd8_0
- xorg-libice=1.1.2=hb9d3cd8_0
- xorg-libsm=1.2.5=he73a12e_0
- xorg-libx11=1.8.10=h4f16b4b_1
- xorg-libxau=1.0.12=hb9d3cd8_0
- xorg-libxcomposite=0.4.6=hb9d3cd8_2
- xorg-libxcursor=1.2.3=hb9d3cd8_0
- xorg-libxdamage=1.1.6=hb9d3cd8_0
- xorg-libxdmcp=1.1.5=hb9d3cd8_0
- xorg-libxext=1.3.6=hb9d3cd8_0
- xorg-libxfixes=6.0.1=hb9d3cd8_0
- xorg-libxi=1.8.2=hb9d3cd8_0
- xorg-libxrandr=1.5.4=hb9d3cd8_0
- xorg-libxrender=0.9.12=hb9d3cd8_0
- xorg-libxtst=1.2.5=hb9d3cd8_3
- xorg-libxxf86vm=1.1.6=hb9d3cd8_0
- yaml=0.2.5=h7f98852_2
- zeromq=4.3.5=h3b0a872_7
- zfp=0.5.5=h9c3ff4c_8
- zipp=3.21.0=pyhd8ed1ab_1
- zlib-ng=2.2.3=h7955e40_0
- zstandard=0.23.0=py312hef9b889_1
- zstd=1.5.6=ha6fb4c9_0

dokken · January 23, 2025, 4:15pm

For me it only runs on a single process (when inspecting the processor activity).

Could you set:
OMP_NUM_THREADS=1 as environment variable to ensure that there is no threading?

Conda creates isolated environments, so there shouldn’t be any cross environment issues.

It seems like an issue with the particular conda-build of PETSc for your platform.

mg-tub · January 23, 2025, 4:50pm

Setting that solves it. That variable wasn’t set before. Interestingly, using your script I still see an improvement at 8 cores, but I have 12 in total. Maybe you were maxing yours out.

The behaviour is then improved for GMRES without AMG too. I can control the way it solves with mpirun.

I am running miniforge. So conda-forge is default for me. I wouldn’t have thought they would do that. Afterall you are also installing all your dependencies via conda-forge, right?

nate · January 23, 2025, 4:52pm

Thanks for pointing this out! This actually highlights an issue in DOLFINx. These settings are being set behind the scenes here insidiously:

github.com/FEniCS/dolfinx

cpp/dolfinx/nls/NewtonSolver.cpp

372f8b76e


      
          nls::petsc::NewtonSolver::NewtonSolver(MPI_Comm comm)
              : _converged(converged), _update_solution(update_solution),
                _krylov_iterations(0), _iteration(0), _residual(0.0), _residual0(0.0),
                _solver(comm), _dx(nullptr), _comm(comm)
          {
            // Create linear solver if not already created. Default to LU.
            _solver.set_options_prefix("nls_solve_");
            la::petsc::options::set("nls_solve_ksp_type", "preonly");
            la::petsc::options::set("nls_solve_pc_type", "lu");
            _solver.set_from_options();
          }

This overwrites the default behaviour of PETSc and should be documented or removed.

dokken · January 23, 2025, 5:10pm

Yes, I use mamba and conda-forge channels.

I was running other things on my computer while checking this, so that might influence the result.

Topic		Replies	Views
Nonlinear heat equation performance dolfin vs. dolfinx dolfinx	12	874	December 11, 2023
MPI acceleration with FEniCSx General	13	195	January 24, 2025
PETSc NonlinearProblem running slow dolfinx	6	413	November 27, 2023
Dolfinx seems much slower than dolfin in solving nonlinear mechanics dolfinx dolfinx	6	171	June 22, 2025
Krylov solver's option max_it and a few questions General	25	730	November 26, 2023

Nonlinear poisson as parallelisation example

Related topics