Interpolation between non-matching mesh is extremely slow in dolfinx 0.7.1

Hi,
I just updated my dolfinx from 0.6 to 0.7.1. When I update my code to the current version, I find that the interpolation between non-matching mesh becomes much slower than before. (When running with mpi 24, its 10 times slower. In serial, it is 70 times slower…). Here is my testing code.

from mpi4py import MPI
from dolfinx import io, mesh, fem
import ufl
import time

msh_old = mesh.create_rectangle(comm=MPI.COMM_WORLD,
                            points=((0.0, 0.0), (4.0, 4.0)), n=(350, 350),
                            cell_type=mesh.CellType.triangle)

msh_new = mesh.create_rectangle(comm=MPI.COMM_WORLD,
                            points=((1.0, 1.0), (5.0, 5.0)), n=(400, 400),
                            cell_type=mesh.CellType.triangle)

P_old     = ufl.FiniteElement("CG", msh_old.ufl_cell(), 1)
P_new     = ufl.FiniteElement("CG", msh_new.ufl_cell(), 1)

W_old     = fem.FunctionSpace(msh_old, P_old)
W_new     = fem.FunctionSpace(msh_new, P_new)

f_old     = fem.Function(W_old)  
f_new     = fem.Function(W_new)  

f_old.interpolate(lambda x: x[0]**2+x[1]**2)

t_begin=time.time()
# for 0.7.1 use:
f_new.interpolate(f_old,nmm_interpolation_data=fem.create_nonmatching_meshes_interpolation_data(
        f_new.function_space.mesh._cpp_object,
        f_new.function_space.element,
        f_old.function_space.mesh._cpp_object))
# for 0.6 use:
#f_new.interpolate(f_old)
t_end=time.time()
print("time:  ", t_end-t_begin)

Since my code needs to do non-matching interpolation frequently, this problem leads to a real headache to me.
Is there any possible ways to speed it up? Or can I use the previous interpolation function in 0.7.1? Thanks very much!

See: Improve collision detection by jorgensd · Pull Request #2862 · FEniCS/dolfinx · GitHub which explains the fix (introduced in 0.7.2) which got combined with Speed up non-matching interpolation data and add extrapolation parameter by jorgensd · Pull Request #2858 · FEniCS/dolfinx · GitHub