Parallel computation of quadrature rules

Hello,

I am trying to parallelize a snippet of code which computes quadrature rules for CutFEM in every cell. I have understood that FEniCS automatically partitions the mesh when mpirun -n nproc is executed. This makes me wonder how one can build a global list of quadrature points and weights from the local ones, by restoring the global cell indexing after the quadratures are obtained. The code is below, but currently does not take care of the repartitioning.

Thanks a lot in advance.

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
  
if (rank == 0):
    start_rows, num_rank_cells = assign_cells_to_processors(size, num_cells)
else:
    start_rows, num_rank_cells = None, None
    
start_rows = comm.bcast(start_rows, root=0)
num_rank_cells = comm.bcast(num_rank_cells, root=0)

for idx in range(num_rank_cells[rank]):
    cell_idx = idx + start_rows[rank]
    dofs = dofmap.links(cell_idx)
    for j in range(num_loc_vertices):
        cell_coords[j] = x[dofs[j], 0:gdim]
    if gdim == 2:
        xmin = np.array([min(cell_coords[:, 0]), min(cell_coords[:, 1])])
        xmax = np.array([max(cell_coords[:, 0]), max(cell_coords[:, 1])])
    else:
        xmin = np.array([min(cell_coords[:, 0]), min(cell_coords[:, 1]), min(cell_coords[:, 2])])
        xmax = np.array([max(cell_coords[:, 0]), max(cell_coords[:, 1]), max(cell_coords[:, 2])])

    # call to C++ function for qr
    cppyy.gbl.run(
        xmin, xmax, int(num_rank_cells[rank]), domain, idx,
        int(cell_idx), degree, do_verbose, do_map, do_scale
    )

Please make a minimal reproducible example. The code below has several unspecified functions, variables and imports, and it is not clear from the question if you use legacy dolfin or dolfinx

Hello @dokken,

Thanks for your answer. You are right, sorry, my idea was to provide more of a pseudocode, as the above function uses stuff coming from other files. The minimal reproducible example which I can think of is below. I am using dolfinx and not legacy dolfin.

import dolfinx
from mpi4py import MPI
import numpy as np
import time
import cppyy

def assign_cells_to_processors(psize, datasize):
    if (psize == 1):
        start_rows, num_rank_cells = [0], [datasize]
    else:
        start_rows, num_rank_cells = [], []
        N_loc = datasize // psize
        start_rows.append(0)
        num_rank_cells.append(N_loc)
        i0 = N_loc
        for _ in range(1, psize-1):
            start_rows.append(i0)
            num_rank_cells.append(N_loc)
            i0 += N_loc
        start_rows.append(i0)
        num_rank_cells.append(datasize-i0)

    return start_rows, num_rank_cells

# Main function
def generate_qr(mesh, NN, degree, domain, opts=[]):
    """degree specifies the degree of the underlying one-dimensional
    Gaussian quadrature scheme and must satisfy 1 <= qo && qo <= 10.
    """
    # include algoim and hppfile for level set
    # cppyy.include(hppfile)

    do_map = True
    do_scale = True
    do_verbose = False

    gdim = mesh.geometry.dim
    num_cells = np.prod(NN)

    num_loc_vertices = 2**gdim
    dofmap = mesh.geometry.dofmap
    x = mesh.geometry.x
    cell_coords = np.zeros((num_loc_vertices, gdim))

    comm = MPI.COMM_WORLD
    size = comm.Get_size()
    rank = comm.Get_rank()
    
    if (rank == 0):
        start_rows, num_rank_cells = assign_cells_to_processors(size, num_cells)
    else:
        start_rows, num_rank_cells = None, None
        
    start_rows = comm.bcast(start_rows, root=0)
    num_rank_cells = comm.bcast(num_rank_cells, root=0)

    for idx in range(num_rank_cells[rank]):
        cell_idx = idx + start_rows[rank]
        dofs = dofmap.links(cell_idx)
        for j in range(num_loc_vertices):
            cell_coords[j] = x[dofs[j], 0:gdim]
        if gdim == 2:
            xmin = np.array([min(cell_coords[:, 0]), min(cell_coords[:, 1])])
            xmax = np.array([max(cell_coords[:, 0]), max(cell_coords[:, 1])])
        else:
            xmin = np.array([min(cell_coords[:, 0]), min(cell_coords[:, 1]), min(cell_coords[:, 2])])
            xmax = np.array([max(cell_coords[:, 0]), max(cell_coords[:, 1]), max(cell_coords[:, 2])])

        # cppyy.gbl.run(
        #     xmin, xmax, int(num_rank_cells[rank]), domain, idx, int(cell_idx), degree, do_verbose, do_map, do_scale
        # )

    # TODO communication between processors (bcast)

# Domain
start_time = time.time()

# Domain and mesh
xmin = np.array([0, 0, 0])
xmax = np.array([1, 1, 1])
N = 10
phi = "schoen-gyroid"

gdim = len(xmin)

if gdim == 2:
    cell_type = dolfinx.mesh.CellType.quadrilateral
    mesh_generator = dolfinx.mesh.create_rectangle
else:
    cell_type = dolfinx.mesh.CellType.hexahedron
    mesh_generator = dolfinx.mesh.create_box

NN = np.array([N] * gdim, dtype=np.int32)
mesh = mesh_generator(MPI.COMM_WORLD, np.array([xmin, xmax]), NN, cell_type)

# Generate qr
degree = 1
generate_qr(mesh, NN, degree, phi, {})

I was mainly interested in understanding whether it is possible to parallelize the computation of quadrature rules for CutFEM over multiple processors. I understand that the mesh cells are already divided among the processes, so is it actually feasible to do this?

If you can compute the quadrature for an individual cell, given its vertices, it should be easy to parallelize this.

I am currently getting a weird error when trying to call dofmap.links() for a certain cell.

But besides this, if you look into the code I am currently splitting cells among processors before starting the loop. I wonder whether this is useless, since dolfinx already splits the mesh automatically, and I can enter into the loop without this prior operation.

Thank you.

I’ll have a look at your code later today.

Thanks a lot @dokken, I wait to hear from you. I might also need your help on this other thread that I opened yesterday Ordering of cells via mesh.topology.original_cell_index

Your code doesn’t quite make sense to me, as it seems like you are trying to assign cells to processes after the partitioning has been done by dolfinx.

See my reply in your other post for context. It would greatly help if you could sketch out what you expect as input to your algorithm.
You could also have a look at:
https://jsdokken.com/dolfinx_docs/meshes.html
and
https://scientificcomputing.github.io/mpi-tutorial/notebooks/dolfinx_MPI_tutorial.html
for parallel specific tutorials.

Hi @dokken,

Thanks for all of your help.

I would like to divide the computation of quadrature rules for CutFEM (which is done with the C++ wrapper cppyy) among processors. Then, I could simply gather all the local information into a unique stacked list of quadrature rules, but I am not sure how this would be implemented.

You are right that it is useless to split cells among processors if this is automatically done by dolfinx. However, it seems to me that the local numbering in each rank is quite random with regards to the global one. Which command allows me to get this?

Furthermore, I might need to bring everything onto one unique rank later on during execution, because the matrix assembly in my case (not done within dolfinx, which does not allow for runtime quadrature rules yet) is not necessarily parallelizable.

Let me know what you think of the whole situation.

Thanks.

The ordering of cells in DOLFINx is determined by a graph partitioner (scotch, kahip or parmetis), that uses the mesh connectivity to decide what process should own what information.

The vertices are also renumbered, to ensure data-locality and efficient parallel computing.

For mapping input mesh nodes see for instance:

More about the mapping of data has also been written in: Gather solutions in parallel in FEniCSX - #2 by dokken

Hello @dokken,

Thanks for your answer.

I was able to find the bug in the dofmap.links() call above, hence now by running in parallel the code the code works and gives decent speedups.

Basically what I miss now to continue is a map between local cells in each processor and the global cells in the mesh (or perhaps a serial mesh which is only on the root processor). I have been giving a look at Gather solutions in parallel in FEniCSX - #2 by dokken but it is not clear to me how you construct this map.

The code is below.

import dolfinx
from mpi4py import MPI
import numpy as np
import cppyy

def get_num_entities(mesh, tdim):
    # Create all connectivities manually (it used to exist a
    # create_connectivity_all function)
    for d0 in range(tdim):
        for d1 in range(tdim):
            mesh.topology.create_connectivity(d0, d1)
    num_owned_entities = mesh.topology.index_map(tdim).size_local
    num_ghost_entities = mesh.topology.index_map(tdim).num_ghosts
    num_entities = num_owned_entities + num_ghost_entities
    return num_entities


def get_num_cells(mesh):
    tdim = mesh.topology.dim
    return get_num_entities(mesh, tdim)


cppyy.cppdef(
    """
    void* double_array(const std::vector<double>& v) {
      double* pf = (double*)malloc(sizeof(double)*v.size());
      for (std::size_t i = 0; i < v.size(); ++i) pf[i] = v[i];
      return pf;
    }"""
)

cppyy.cppdef(
    """
    void* int_array(const std::vector<int>& v) {
      int* pf = (int*)malloc(sizeof(int)*v.size());
      for (std::size_t i = 0; i < v.size(); ++i) pf[i] = v[i];
      return pf;
    }"""
)


def create_double_array(vec):
    array = cppyy.gbl.double_array(vec)
    v = np.frombuffer(array, dtype=np.float64, count=len(vec))
    return v


def create_int_array(vec):
    array = cppyy.gbl.int_array(vec)
    v = np.frombuffer(array, dtype=np.int32, count=len(vec))
    return v


def create_list_of_arrays(v):
    w = [None] * len(v)
    for i in range(len(v)):
        w[i] = create_double_array(v[i])
    return w


# Main function
def generate_qr(mesh, NN, domain, degree):
    cppyy.add_include_path("/usr/include/algoim/algoim")
    if domain in ["square", "circle", "sphere"]:
        hppfile = f"utils/{domain}.hpp"
    else:  # it is a gyroid, specific struct is specified later
        hppfile = "utils/gyroids.hpp"

    cppyy.include(hppfile)
    do_map = True
    do_scale = True
    do_verbose = False

    gdim = mesh.geometry.dim
    num_cells = get_num_cells(mesh)

    num_loc_vertices = 2**gdim
    dofmap = mesh.geometry.dofmap
    x = mesh.geometry.x
    cell_coords = np.zeros((num_loc_vertices, gdim))

    t = dolfinx.common.Timer()
    
    for idx in range(num_cells):
        dofs = dofmap.links(idx)
        for j in range(num_loc_vertices):
            cell_coords[j] = x[dofs[j], 0:gdim]
        if gdim == 2:
            xmin = np.array([min(cell_coords[:, 0]), min(cell_coords[:, 1])])
            xmax = np.array([max(cell_coords[:, 0]), max(cell_coords[:, 1])])
        else:
            xmin = np.array([min(cell_coords[:, 0]), min(cell_coords[:, 1]), min(cell_coords[:, 2])])
            xmax = np.array([max(cell_coords[:, 0]), max(cell_coords[:, 1]), max(cell_coords[:, 2])])
    
        cppyy.gbl.run(
            xmin, xmax, int(num_cells), domain, gdim, idx, int(idx), degree, do_verbose, do_map, do_scale
        )
    
    qr_pts_ = create_list_of_arrays(cppyy.gbl.get_qr_pts())
    qr_w_ = create_list_of_arrays(cppyy.gbl.get_qr_w())
    qr_pts_bdry_ = create_list_of_arrays(cppyy.gbl.get_qr_pts_bdry())
    qr_w_bdry_ = create_list_of_arrays(cppyy.gbl.get_qr_w_bdry())
    qr_n_ = create_list_of_arrays(cppyy.gbl.get_qr_n())
    
    cut_cells_ = create_int_array(cppyy.gbl.get_cut_cells())
    uncut_cells_ = create_int_array(cppyy.gbl.get_uncut_cells())
    outside_cells_ = create_int_array(cppyy.gbl.get_outside_cells())

    imap = mesh.topology.index_map(2)
    print(imap.local_range)
    # local_range = np.asarray(imap.local_range, dtype=np.int32) * dofmap.index_map_bs
    # ranges = mesh.comm.gather(local_range, root=0)

    # if mesh.comm.rank == 0:
    #     print(ranges)

    qr_pts = mesh.comm.gather(qr_pts_, root=0)
    qr_w = mesh.comm.gather(qr_w_, root=0)
    qr_pts_bdry = mesh.comm.gather(qr_pts_bdry_, root=0)
    qr_w_bdry = mesh.comm.gather(qr_w_bdry_, root=0)
    qr_n = mesh.comm.gather(qr_n_, root=0)
    cut_cells = mesh.comm.gather(cut_cells_, root=0)
    uncut_cells = mesh.comm.gather(uncut_cells_, root=0)
    outside_cells = mesh.comm.gather(outside_cells_, root=0)
    if (mesh.comm.rank == 0):
        qr_pts = [item for sublist in qr_pts for item in sublist]
        qr_w = [item for sublist in qr_w for item in sublist]
        qr_pts_bdry = [item for sublist in qr_pts_bdry for item in sublist]
        qr_w_bdry = [item for sublist in qr_w_bdry for item in sublist]
        qr_n = [item for sublist in qr_n for item in sublist]
        cut_cells = [item for sublist in cut_cells for item in sublist]
        uncut_cells = [item for sublist in uncut_cells for item in sublist]
        outside_cells = [item for sublist in outside_cells for item in sublist]

        print(f"Time for getting quadrature rule {t.elapsed()[0]}")

# Domain and mesh
xmin = np.array([0, 0])
xmax = np.array([1, 1])
N = 128
phi = "schoen-gyroid"

gdim = len(xmin)

if gdim == 2:
    cell_type = dolfinx.mesh.CellType.quadrilateral
    mesh_generator = dolfinx.mesh.create_rectangle
else:
    cell_type = dolfinx.mesh.CellType.hexahedron
    mesh_generator = dolfinx.mesh.create_box

NN = np.array([N] * gdim, dtype=np.int32)
mesh = mesh_generator(MPI.COMM_WORLD, np.array([xmin, xmax]), NN, cell_type, ghost_mode=dolfinx.cpp.mesh.GhostMode.none)
generate_qr(mesh, NN, phi, 1)

I’ve already showed you this map in: Ordering of cells via mesh.topology.original_cell_index - #2 by dokken

But how can I do this on a serial mesh which is ordered again, such that I can continue with the rest of the code in serial?

That is what the post referenced above does: Gather solutions in parallel in FEniCSX - #2 by dokken

I dont really have the bandwidth to go through it and add more documentation.