As you have not supplied any code, it is hard to pinpoint why this is the case for you.
I think one thing your are not taking into account is the setup cost of the sparse matrices.
See the following example which illustrates run-times where you only iterate over a small sub-set of your mesh.
from dolfin import *
N = 400
frac = 0.02
mesh = UnitSquareMesh(N, N)
V = FunctionSpace(mesh, "CG", 1)
class domain(SubDomain):
def inside(self, x, on_boundary):
return x[0]< frac
mf = MeshFunction("size_t", mesh, mesh.topology().dim(), 0)
domain().mark(mf, 1)
dx_c = Measure("dx", domain=mesh, subdomain_data=mf)
u = TrialFunction(V)
v = TestFunction(V)
a = inner(grad(u), grad(v))*dx_c
a_form = Form(a)
A = assemble(a_form)
a_sub = inner(grad(u), grad(v))*dx_c(1)
a_sub_form = Form(a_sub)
A_sub = assemble(a_sub_form)
import timeit
runs = 100
full = timeit.timeit(lambda: assemble(a_form, tensor=A), number=runs)
sub = timeit.timeit(lambda: assemble(a_sub_form, tensor=A_sub), number=runs)
print(sub/full, frac, sub, frac)
For this example, this returns:
0.028362406769913293 0.02 0.20799705399986124 0.02
If you reduce the fraction even further, for instance to (0.001) you will observe that we do not get optimal performance any longer, as we get:
0.01412187111184388 0.001 0.1051290079994942 0.001
This is because there are certain costs associated with assembling a matrix. Assembling over only 10 cells is very little, and the setup cost is dominating the run-time.