Using fenics in cython to accelerate loops over mesh

Dear users,
I would like to report a small contribution to the community by showing an example of using cython to accelerate loops over mesh. This might be of interest for those who need to do such tasks without knowledge about c++ programming (someone like me), since looping over all elements of a mesh in python is slow.

Let me first introduce the context in which I needed to write a loop over mesh. Of course, the approach I took is by no mean optimal, and might be over complicated, and please suggest anything you think that is better.

Local quadratic interpolation

Let’s say we have a residual \rho(u_h,z-z_h), possibly nonlinear in u_h but linear in z-z_h, which is commonly used in goal-oriented error estimates and let’s say u_h and z_h are solutions of the Galerkin approximation of the variational problem involving u and z. We assume that u_h and z_h are piecewise linear continuous (using P1 finite element). The residual is not yet computable because there is z which is still an unknown quantity.
It is common to perform a local quadratic interpolation of z_h, denoted by \tilde{z}_h and compute and locate cell-wise the residual \rho(u_h,\tilde{z_h}-z_h). In fenics, once you have u_h,\tilde{z}_h and z_h, to locate cell wise \rho(u_h,\tilde{z_h}-z_h), you declare a test function t over the mesh in the space DG0 and use assemble_vector to the form \rho(u_h,t(\tilde{z_h}-z_h)). So the question is about computing \tilde{z}_h.
Now let’s define properly the local quadratic interpolation. Assume that we have a P1 function over a triangular mesh v_h, the local quadratic interpolation \tilde{v_h} is a DG2 function defined over the same mesh such that for all triangle T, the restriction of \tilde{v}_h over T interpolates v_h over 6 points : the 3 vertices of T and the 3 opposite vertices of the 3 surrounding triangles. We perform some necessary virtual extension if the triangle T has at least one exterior facet. The situation is explained nicely in the following picture from Hintermüller, 2014.

So in order to compute \tilde{t}_h we need to

for cell in mesh:
    collect_interpolation_points_and_values()
    build_linear_system() # a small 6 by 6 system to be solved
    solve_local_system() # find the 6 local dof of \tilde{t}_h

This loop is in general slow, because it is coded in python. By profiling my experiments, the local quad interpolation is the bottleneck of a solve-refine-solve iteration and it took almost 80% of the runtime for linear PDEs and mesh refinement. This loop can be avoided if we were able of assembling the following system:

\sum_T\sum_{x\in I(T)} \tilde{v}_h(x)w_h(x)=\sum_T\sum_{x\in I(T)} v_h(x)w_h(x),\forall w_h\in DG2

where I(T) is set of 6 local interpolation points of cell T, remark that not all the points of I(T) are inside of T, by the nature of local quadratic interpolation that we have defined. Moreover, by pulling back to the reference cell, I(T) are not all the same across all T because of the irregular shapes of the triangle. As a result, I didn’t known how to do this through the assemble_matrix functionality.

Cython solution of local quadratic interpolation

It is possible to call c++ objects and functions directly in cython, if you have the header files and provide necessary facility to link the object files. I had no idea how to do it for the very complex fenics project neither. If you include python objects (which is possible) in a cython program you will not enjoy any significant speedup so I implement some naive wrappers of the adjacency and mesh using the cdef class functionality

cdef class _AdjHelper():
    cdef int[:] _array
    cdef int[:] _offsets

    def __init__(self, np.ndarray[NP_INT,ndim=1] array, np.ndarray[NP_INT,ndim=1] offsets):
        self._array = np.array(array,dtype=np.int32) #this will take the array attribute of an AdjacencyList
        self._offsets = np.array(offsets,dtype=np.int32) #this will take the offsets attribute of an AdjacencyList

    cdef set[int] links(self, int i):
        cdef set[int] res = set[int]()
        cdef Py_ssize_t k_

        for k_ in range(self._offsets[i],self._offsets[i+1]):
            res.insert(self._array[k_])

        return res

and

cdef class _MeshHelper():
    cdef int dim
    cdef int ncells
    cdef NP_FLOAT[:,:] x
    cdef NP_FLOAT[:] h
    cdef _AdjHelper c2f
    cdef _AdjHelper f2c
    cdef _AdjHelper c2n
    cdef _AdjHelper f2n

    def __init__(self,
                int dim,
                int ncells,
                np.ndarray[NP_FLOAT,ndim=2] x, #nodes coordinates
                np.ndarray[NP_FLOAT] h, #triangle sizes
                _AdjHelper c2f, #cell to facet connectivity
                _AdjHelper f2c, #facet to cell
                _AdjHelper c2n, #cell to node
                _AdjHelper f2n) -> None: #facet to nodes
        
        self.dim = dim
        self.ncells = ncells
        self.x = np.array(x) 
        self.h = np.array(h)
        self.c2f = c2f
        self.f2c = f2c
        self.c2n = c2n
        self.f2n = f2n

This is possible due to the data-centric design of fenics, if I’m not wrong with the technical word. The NP_FLOAT[:] things are the memory_view implementations of cython and NP_FLOAT is sort of a macro for the np.float64, for those who are interested you can find details on cython website. The set[int] thing is the standard c++ set of integers as you can call c++ objects directly in cython if you have the access to header files. I use this because I needed to do set union and difference later.

So now I rewrite the loop described in the previous section entirely in cython by avoiding any python object. The many small 6 by 6 systems that are solved per triangle are performed by the low level linear algebra package lapack provided by scipy.

Finally I have achieved a significant speed up with this approach, now the runtime is at best of the same magnitude as solving the linear system of finite element.

Thank you for reading until the end of the post, I hope this could be situationally useful for you, and please feel free to give any comment and if you know how to call directly the c++ part of the fenicsx in cython please tell me ! Thank you so much.