How to best utilize MPI and CUDA

usiu5555 · December 18, 2025, 3:53am

What is the recommended approach when having one CUDA graphics card and multi-core processor? Is an idea to distribute Dolfinx on multiple processors with MPI and use PETSc CUDA-supported solvers on the same GPU reasonable or will it be a problem to share the same GPU (due to communication, memory or effciency)? Or is it pointless and would be worse than just one process using one CUDA card?

Stein · December 22, 2025, 6:43pm

Are you aware of this package GitHub - bpachev/cuda-dolfinx: GPU acceleration extension for FEniCSx ?

I’ve not yet experimented with it myself, but if I remember correctly, the author showed at FEniCS2025 that 1 CUDA card significantly outperforms a full node of MPI processes. Hinting at a futility in trying to combine the two.

But I am really no expert. @bpachev, willing to comment?

bpachev · December 22, 2025, 11:37pm

@usiu5555 The referenced package (of which I’m the author) extends DOLFINx to enable GPU-accelerated assembly operations, in addition to using PETSc’s CUDA supported solvers. It turns out that in most cases assembly actually is sped up more by the GPU than the linear solve. The speedup factor will depend on the number of DOFs, element order, and GPU, but it is fairly common to see one GPU card attain equivalent performance to several hundred CPU cores. My package also supports multi-GPU acceleration, in case the problem is too large to fit on a single card.

It is possible for multiple PETSc MPI processes to share the same GPU. This adds overhead, which can get to be a lot as the number of processes increases. It is also possible to use PETSc’s CUDA solvers with regular dolfinx, but you incur a lot of copies of the matrices and vectors between device and host memory. Using cuda-dolfinx with one MPI process per GPU solves those problems :).

I’m happy to help answer any questions you may have about installing or using cuda-dolfinx.

All the best,

Benjamin

usiu5555 · December 23, 2025, 2:28am

Dear Benjamin, thank you for the detailed answer. If I understood your explanations correctly:

one GPU + multiple nodes: can work with Cuda-Dolfinx for low number of MPI processes and is not recommended for standard Dolfinx due to frequent data transfer.
one GPU per one node: works on both versions, Cuda-Dolfinx can extend GPU usage to the assembly (Dolfinx) operations in addition to the solver
multiple GPU + one node: supported by Cuda-Dolfinx

I will be dealing with the one GPU + multiple nodes case sometime early next year, I will try to install and use your package and contact you if problems arise.

Best,
Szymon

bpachev · December 24, 2025, 5:31am

By “one GPU + multiple nodes”, do you mean multiple machines, only one of which has a GPU card present? This will not work with CUDA PETSc solvers or cuDOLFINx. Each MPI process needs to be able to see a GPU (not necessarily unique).

Both PETSc and cuDOLFINx support using multiple GPUs (either on a single machine or across multiple machines). You can have more MPI processes than GPUs, but all of the MPI processes must be able to see a GPU, which requires being on the same node as a GPU.

All the best,

Benjamin

usiu5555 · January 5, 2026, 5:56am

Sorry for late reply, by multiple nodes I was thinking about the case when all nodes have access to the same GPU. Specifically, a workstation with a GPU and a multicore processor (as far as my understanding goes, multiple cores and multiple machines are not different from MPI’s point of view). Thank you for the clarification.

Best,
Szymon

Topic		Replies	Views
Multiple cpus only for solving part	3	474	April 21, 2021
Can GPU be used in docker version of Dolfinx? dolfinx	6	900	September 12, 2022
Is there any considerable speedup on dual core + gpu instead of single core + gpu?	2	588	January 31, 2021
Running dolfinx on parallel processors dolfinx	6	1188	April 20, 2022
Using MPI with FEniCS to distribute computational load; not to spawn multiple instances of same code	7	1775	September 20, 2021

How to best utilize MPI and CUDA

Related topics