How to best utilize MPI and CUDA

What is the recommended approach when having one CUDA graphics card and multi-core processor? Is an idea to distribute Dolfinx on multiple processors with MPI and use PETSc CUDA-supported solvers on the same GPU reasonable or will it be a problem to share the same GPU (due to communication, memory or effciency)? Or is it pointless and would be worse than just one process using one CUDA card?

1 Like

Are you aware of this package GitHub - bpachev/cuda-dolfinx: GPU acceleration extension for FEniCSx ?

I’ve not yet experimented with it myself, but if I remember correctly, the author showed at FEniCS2025 that 1 CUDA card significantly outperforms a full node of MPI processes. Hinting at a futility in trying to combine the two.

But I am really no expert. @bpachev, willing to comment?

1 Like

@usiu5555 The referenced package (of which I’m the author) extends DOLFINx to enable GPU-accelerated assembly operations, in addition to using PETSc’s CUDA supported solvers. It turns out that in most cases assembly actually is sped up more by the GPU than the linear solve. The speedup factor will depend on the number of DOFs, element order, and GPU, but it is fairly common to see one GPU card attain equivalent performance to several hundred CPU cores. My package also supports multi-GPU acceleration, in case the problem is too large to fit on a single card.

It is possible for multiple PETSc MPI processes to share the same GPU. This adds overhead, which can get to be a lot as the number of processes increases. It is also possible to use PETSc’s CUDA solvers with regular dolfinx, but you incur a lot of copies of the matrices and vectors between device and host memory. Using cuda-dolfinx with one MPI process per GPU solves those problems :).

I’m happy to help answer any questions you may have about installing or using cuda-dolfinx.

All the best,

Benjamin

4 Likes

Dear Benjamin, thank you for the detailed answer. If I understood your explanations correctly:

  • one GPU + multiple nodes: can work with Cuda-Dolfinx for low number of MPI processes and is not recommended for standard Dolfinx due to frequent data transfer.
  • one GPU per one node: works on both versions, Cuda-Dolfinx can extend GPU usage to the assembly (Dolfinx) operations in addition to the solver
  • multiple GPU + one node: supported by Cuda-Dolfinx

I will be dealing with the one GPU + multiple nodes case sometime early next year, I will try to install and use your package and contact you if problems arise.

Best,
Szymon

By “one GPU + multiple nodes”, do you mean multiple machines, only one of which has a GPU card present? This will not work with CUDA PETSc solvers or cuDOLFINx. Each MPI process needs to be able to see a GPU (not necessarily unique).

Both PETSc and cuDOLFINx support using multiple GPUs (either on a single machine or across multiple machines). You can have more MPI processes than GPUs, but all of the MPI processes must be able to see a GPU, which requires being on the same node as a GPU.

All the best,

Benjamin

1 Like

Sorry for late reply, by multiple nodes I was thinking about the case when all nodes have access to the same GPU. Specifically, a workstation with a GPU and a multicore processor (as far as my understanding goes, multiple cores and multiple machines are not different from MPI’s point of view). Thank you for the clarification.

Best,
Szymon

1 Like