When calling the FEniCS program, there is no parallel or multithreaded execution

Hello everyone! I try to use scipy minimization package to solve a minimization problems, where I will use some fenics programs to define an object function and its gradient function. The code is like this:

def objective_and_gradient(ks_flat):
    k_s.x.array[:] = ks_flat
    u_now, data_error_norm = heat_solver(ks_flat)
    r_delta, ks_delta, k0_delta = D_adjoint(ks_flat)
    grad = ks_delta 
    return data_error_norm, grad

result = minimize(
    fun=objective_and_gradient,
    x0=ks_initial,
    method='L-BFGS-B',
    jac=True, 
    options={'maxiter': N_k, 'disp': True}
)

Here heat_solver and D_adjoint are two solvers using fenics to solve heat equations. If I run these two functions on the server, it will utilize all available CPUs, but when I use the aforementioned program during SciPy optimization, it only uses one thread.
A simplified code of heat_solver is

def heat_solver(ks):
    dt = (T - t) / num_steps  
    u_sol=np.zeros((num_steps, (k*nx+1)*(k*ny+1)))
    domain=create_rectangle(MPI.COMM_WORLD, [np.array([0, 0]), np.array([Lx, Ly])],
                               [nx, ny],  mesh.CellType.triangle)
    V = fem.functionspace(domain, ("Lagrange", k))
    tdim = domain.topology.dim
    fdim = tdim - 1
    domain.topology.create_connectivity(fdim, tdim)
    u_n = fem.Function(V)
    kappa = fem.Function(V)
    kappa.x.array[:] = ks
    u_n.x.array[:] = u_ini
    f = fem.Function(V)
    solver = PETSc.KSP().create(domain.comm)
    solver.setType(PETSc.KSP.Type.GMRES)
    solver.getPC().setType(PETSc.PC.Type.ILU)
    uh = fem.Function(V)
    error=0
    uh = fem.Function(V)
    integral_form=form((uh-u_n)*(uh-u_n)*dx)
    for n in range(num_steps):
        f.x.array[:] = f_source[n,:]
        u, v = ufl.TrialFunction(V), ufl.TestFunction(V)
        F = u * v * ufl.dx +dt *kappa * ufl.dot(ufl.grad(u), ufl.grad(v)) * ufl.dx - (u_n + dt * f) * v * ufl.dx
        a = fem.form(ufl.lhs(F))
        L = fem.form(ufl.rhs(F))
        A = assemble_matrix(a)
        A.assemble()
        b = create_vector(L)
        solver.setOperators(A)
        with b.localForm() as loc_b:
            loc_b.set(0)
        assemble_vector(b, L)
        solver.solve(b, uh.vector)
        uh.x.scatter_forward()
        u_n.x.array[:] = uh.x.array
        u_sol[n,:] = uh.x.array
        uh.x.array[:] = u_data[n,:]
        error += dt*domain.comm.allreduce(assemble_scalar(integral_form), op=MPI.SUM)
    return u_sol, np.sqrt(error)

I hope that someone can tell me how to change my code such that it will still utilize all available CPUs when I use scipy minimization functions.

Seems to me like this is a scipy question. Determine first if scipy is able to run in a multithreaded way (asking on the appropriate forum), and once you have determined this please get in touch with me to have the thread reopened.