Problems paralelizing a mesh generated using gmsh python api

Hi all, I am having problems parallelizing my case, and I don’t know what happened, I can go over 4 ranks, and if I print the DOFS on each rank, I can see that they are not distributed correctly, leaving some ranks with empty mesh. See my MWE.

import os
import gmsh
from dolfinx import fem, io
from mpi4py import MPI


# --------- #
# CONSTANTS #
# --------- #

MPI_COMM = MPI.COMM_WORLD
CURRENT_FOLDER = os.path.dirname(__file__)
os.chdir(CURRENT_FOLDER)

height = 150  # m
diameter = 6.5  # m
thickness = 0.055 # 55mm

# ------- #
# MESHING #
# ------- #
FIXED_TAG = 1000
BEAM_SURFACE_TAG = 2000
VOLUME_TAG = 4000
ELEMENTS_ORDER = 2
VECTOR_SPACE_DEGREE = 2


def gmsh_tower(h: float, d: float) -> gmsh.model:
    gmsh.initialize()
    gmsh.option.setNumber("General.Terminal", 0)
    model = gmsh.model()

    # Recombine tetrahedra to hexahedra
    gmsh.option.setNumber("Mesh.RecombinationAlgorithm", 2)
    gmsh.option.setNumber("Mesh.RecombineAll", 2)
    gmsh.option.setNumber("Mesh.CharacteristicLengthFactor", 0.02)

    outer_circle = model.occ.addDisk(0, 0, 0, d / 2, d / 2)
    inner_circle = model.occ.addDisk(0, 0, 0, d / 2 - thickness, d / 2 - thickness)

    outer = model.occ.extrude([(2, outer_circle)], 0, 0, h, numElements=[1.5*h], recombine=True)
    inner = model.occ.extrude([(2, inner_circle)], 0, 0, h, numElements=[1.5*h], recombine=True)

    outer_volume = outer[1][1]
    inner_volume = inner[1][1]
    model.occ.cut([(3, outer_volume)], [(3, inner_volume)])
    
    model.occ.synchronize()

    fixed_sf = model.addPhysicalGroup(2, [7], tag=FIXED_TAG)
    tip_sf = model.addPhysicalGroup(2, [6,8], tag=BEAM_SURFACE_TAG)
    vol = model.addPhysicalGroup(3, [1], tag=VOLUME_TAG)

    model.setPhysicalName(2, fixed_sf, "FIXED SURFACE")
    model.setPhysicalName(2, tip_sf, "LOAD SURFACE")
    model.setPhysicalName(3, vol, "Mesh volume")

    model.mesh.setOrder(ELEMENTS_ORDER)
    model.mesh.generate(3)
    model.mesh.optimize()
    gmsh.write("mesh.msh")
    return model

model = gmsh_tower(h=height, d=diameter)
domain, cell_markers, facet_markers = io.gmshio.model_to_mesh(model, MPI_COMM, rank=0)
#domain, cell_markers, facet_markers = io.gmshio.read_from_msh("mesh.msh", MPI_COMM, 0, gdim=3)

# -------------- #
# Function Space #
# -------------- #
dim = domain.geometry.dim
V = fem.functionspace(domain, ("Lagrange", VECTOR_SPACE_DEGREE, (dim,)))
bs = V.dofmap.index_map_bs
num_dofs_local = V.dofmap.index_map.size_local
print(f"Rank {domain.comm.rank}, Block size {bs} Num local dofs {num_dofs_local*bs}")

Results:

->$: mpirun -np 2 python mesh.py 
Rank 0, Block size 3 Num local dofs 0
Rank 1, Block size 3 Num local dofs 920040
->$: mpirun -np 3 python mesh.py 
Rank 2, Block size 3 Num local dofs 0
Rank 0, Block size 3 Num local dofs 0
Rank 1, Block size 3 Num local dofs 920040
->$: mpirun -np 4 python mesh.py 
Rank 0, Block size 3 Num local dofs 0
Rank 2, Block size 3 Num local dofs 0
Rank 1, Block size 3 Num local dofs 462030
Rank 3, Block size 3 Num local dofs 458010

Going over 4 ranks the script freezes and never returns. This is my DOLFINx version:

In [2]: dolfinx.__version__
Out[2]: '0.8.0'

And my lscpu which shows that I have enough power to handle more than 4 threads, also I test on another PC with same results

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   36
  On-line CPU(s) list:    0-35
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) W-2195 CPU @ 2.30GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   18
    Socket(s):            1
    Stepping:             4
    CPU(s) scaling MHz:   28%
    CPU max MHz:          4300.0000
    CPU min MHz:          1000.0000
    BogoMIPS:             4599.93
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs b
                          ts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_
                          deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase
                           tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cq
                          m_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi md_clear flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    576 KiB (18 instances)
  L1i:                    576 KiB (18 instances)
  L2:                     18 MiB (18 instances)
  L3:                     24.8 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-35
Vulnerabilities:          
  Gather data sampling:   Mitigation; Microcode
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                    Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Mitigation; Clear CPU buffers; SMT vulnerable
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; IBRS
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Mitigation; Clear CPU buffers; SMT vulnerable

I can’t seem to reproduce your error with the v0.8.0 build (docker run -ti dolfinx/dolfinx:v0.8.0):

root@fc0b80c61a50:~# python3 tower.py 
Rank 0, Block size 3 Num local dofs 920040
root@fc0b80c61a50:~# mpirun -np 2 python3 tower.py 
Rank 1, Block size 3 Num local dofs 458004
Rank 0, Block size 3 Num local dofs 462036
root@fc0b80c61a50:~# mpirun -np 3 python3 tower.py 
Rank 0, Block size 3 Num local dofs 309348
Rank 2, Block size 3 Num local dofs 311403
Rank 1, Block size 3 Num local dofs 299289
root@fc0b80c61a50:~# mpirun -np 4 python3 tower.py 
Rank 0, Block size 3 Num local dofs 232572
Rank 1, Block size 3 Num local dofs 225375
Rank 2, Block size 3 Num local dofs 228564
Rank 3, Block size 3 Num local dofs 233529
root@fc0b80c61a50:~# python3 -c "import dolfinx; print(dolfinx.__version__)"
0.8.0

@nate thank for checking my script. Do you have any idea about what could be happening here?

Unfortunately, no. The zero DoFs on 2 out of 4 processes in the 4 process run is bizarre. Do you see the same if you use one of the Docker images?

If I were to throw a guess, there is something wrong in your mpi4py installation, e.g. yo upgraded it after building dolfinx, or you built it with one mpi implementation and are actually using another

1 Like

Hi again @nate, no I’m not using docker image because I’m doing FSI so I need the precice library which is not in the docker image and I don’t have much time (and knowledge) to create a docker image with all my requirements. Also because I’m running on HPC so what I did is to create the easyconfigs for the EasyBuild (package manager of the HPC).

@francesco-ballarin: I have rebuild all the tool chain from scratch an get the same result. So I don’t know if there is some specific requirement of the MPI compilation for DOLFINx. Maybe yes because I use the same setup on 3PC, One with Fedora, one with Ubuntu, and the HPC with Rocky.

from mpi4py import MPI
mpi_impl = MPI.get_vendor()
mpi_version = MPI.Get_version()
print(f"Implementador de MPI: {mpi_impl[0]}")
print(f"Version de MPI: {mpi_version[0]}.{mpi_version[1]}")
Implementador de MPI: Open MPI
Version de MPI: 3.1

and the mpi version:

$ mpirun --version

mpirun (Open MPI) 4.1.5
$ mpicc --version

Using built-in specs.
COLLECT_GCC=/home/efirvida/.local/easybuild/software/GCCcore/12.3.0/bin/gcc
COLLECT_LTO_WRAPPER=/home/efirvida/.local/easybuild/software/GCCcore/12.3.0/libexec/gcc/x86_64-pc-linux-gnu/12.3.0/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-languages=c,c++,fortran --without-cuda-driver --enable-offload-targets=nvptx-none --enable-lto --enable-checking=release --disable-multilib --enable-shared=yes --enable-static=yes --enable-threads=posix --enable-plugins --enable-gold --enable-ld=default --prefix=/home/efirvida/.local/easybuild/software/GCCcore/12.3.0 --with-local-prefix=/home/efirvida/.local/easybuild/software/GCCcore/12.3.0 --enable-bootstrap --with-isl=/home/efirvida/.local/easybuild/build/GCCcore/12.3.0/system-system/gcc-12.3.0/stage2_stuff --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.3.0 (GCC)

One thing you can try is to move the mpi4py import to the top of all imports, as there are cases where PETSc wrongly initializes the number of threads used with mpi.

Please also note in the current way you have written your code, there is a mesh generated and written on every process, which is not optimal (but shouldn’t influence the result).

I tried this with same results:

from mpi4py import MPI
import numpy as np
from dolfinx import fem, mesh

comm = MPI.COMM_WORLD
WIDTH, HEIGHT = 1, 1
NX, NY = 100, 100

domain = mesh.create_rectangle(
    comm,
    [np.array([-WIDTH / 2, 0]), np.array([WIDTH / 2, HEIGHT])],
    [NX, NY],
    cell_type=mesh.CellType.quadrilateral,
)

V = fem.functionspace(domain, ("Lagrange", 2, (2,)))
bs = V.dofmap.index_map_bs
num_dofs_local = V.dofmap.index_map.size_local
print(f"Rank {domain.comm.rank}, Block size {bs} Num local dofs {num_dofs_local*bs}")

As I cannot reproduce this behaviour with docker, could it be that there is an issue with the partitioner that you are using?

What mesh partitioners have you installed?
The possibilities are shown at: Mesh creation in serial and parallel — FEniCSx Documentation

Yea, you are right, the problem is with the parametis partitioner.

from mpi4py import MPI
import numpy as np
from dolfinx import fem, mesh
import dolfinx

comm = MPI.COMM_WORLD
WIDTH, HEIGHT = 1, 1
NX, NY = 100, 100


try:
    from dolfinx.graph import partitioner_scotch

    has_scotch = True
except ImportError:
    has_scotch = False
try:
    from dolfinx.graph import partitioner_kahip

    has_kahip = True
except ImportError:
    has_kahip = False
try:
    from dolfinx.graph import partitioner_parmetis

    has_parmetis = True
except ImportError:
    has_parmetis = False
print(f"{has_scotch=}  {has_kahip=} {has_parmetis=}")

this say that I have

has_scotch=True  has_kahip=False has_parmetis=True

then if I use:

partitioner = dolfinx.mesh.create_cell_partitioner(partitioner_parmetis())

domain = mesh.create_rectangle(
    comm,
    [np.array([-WIDTH / 2, 0]), np.array([WIDTH / 2, HEIGHT])],
    [NX, NY],
    cell_type=mesh.CellType.quadrilateral,
    partitioner=partitioner,
)

the partition is wrong, but with partitioner_scotch

partitioner = dolfinx.mesh.create_cell_partitioner(partitioner_scotch())

domain = mesh.create_rectangle(
    comm,
    [np.array([-WIDTH / 2, 0]), np.array([WIDTH / 2, HEIGHT])],
    [NX, NY],
    cell_type=mesh.CellType.quadrilateral,
    partitioner=partitioner,
)

The mesh is split evenly in all processors and also I can go over 4 cores now. Thanks again. Seems that parametis is the default option.