Assembly of FEniCS submatrices in large PETSc matrix: preallocation

Hello,

I suspect the issue is with PETSc rather than with FEniCS, but given the problem set-up I think this is mostly useful for FEnICS users.

Suppose I wish to assemble a large system matrix from different blocks, each assembled by FEniCS (for e.g. domain decomposition, hybridization, multimesh methods, multiphysics coupling, etc). I am noticing a very rapid speed drop when I repeat the operation of copying the FEniCS matrix into the large matrix. Here is a MWE that illustrates my problem (in reality I don’t actually have a purely block diagonal system):

from dolfin import *
import numpy as np
from petsc4py import PETSc
import resource
import time

# Parameters
MANUAL_PREALLOCATION = True
width = 1
elements_x_per_block = 100
total_blocks = 15

# Generate the block matrix
mesh = UnitSquareMesh(elements_x_per_block,elements_x_per_block)
V = FunctionSpace(mesh,'CG',1)
phi = TrialFunction(V)
v = TestFunction(V)
A_block = as_backend_type( assemble( inner(grad(phi),grad(v))*dx ) )
A_size = A_block.size(0)

# "Preallocate" the large matrix
memory_start = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
M_size = A_size*total_blocks
M = PETSc.Mat().create()
M.setSizes(M_size)
M.setType('aij')
if MANUAL_PREALLOCATION:
    # Manually set the anticipated nz values per row
    M.setPreallocationNNZ( A_block.nnz()//A_size )
    # This option seems to defeat the point but mitigates an error
    M.setOption(PETSc.Mat.Option.NEW_NONZERO_ALLOCATION_ERR, False)
else:
    M.setUp()
M.assemble()
print("Memory used for preallocation:",resource.getrusage(resource.RUSAGE_SELF).ru_maxrss-memory_start)

# Place blocks
time_start = time.time()
for block_nr in range(total_blocks):
    offset = block_nr*A_size
    
    # CSR arrays from A_block matrix
    rows_ind,cols,vals = A_block.mat().getValuesCSR()
    
    # Creating the CSR arrays for A matrix
    cols_to = cols + offset
    rows_ind_to = np.ones(M_size+1,dtype=np.int32)*len(vals)
    rows_ind_to[0:offset] = 0
    rows_ind_to[offset:offset+len(rows_ind)-1] = rows_ind[0:-1]
    
    # Placing into A matrix
    M.setValuesCSR(rows_ind_to, cols_to, vals, addv=PETSc.InsertMode.ADD)
    M.assemble()
    
    # Output
    if (block_nr+1)%5 == 0:
        time_end = time.time()
        print("Time to assemble block %i till block %i:"%(block_nr-4,block_nr+1), time_end-time_start)
        time_start = time_end
    if block_nr == 0:
        print("Time to assemble the first block:", time.time()-time_start)
print("Memory used for preallocation plus assembly:",resource.getrusage(resource.RUSAGE_SELF).ru_maxrss-memory_start)

with output:

Memory used for preallocation: 4272
Time to assemble the first block: 0.8050854206085205
Time to assemble block 0 till block 5: 5.991081237792969
Time to assemble block 5 till block 10: 16.175475358963013
Time to assemble block 10 till block 15: 54.92933130264282
Memory used for preallocation plus assembly: 57696

I expect this is due to faulty preallocation of the large matrix. However, I get (roughly) the same result whether I set “MANUAL_PREALLOCATION” to True or False in the above code snippet. Am I doing this wrong?

Setting this to False is likely the cause of your expense. It negates any benefit you would observe from preallocation. I suggest you work through your code to ensure your sparsity pattern is well defined.

Also you’re assembling (M.assemble()) many times during the loop. Shouldn’t you only need to do this once at the end once you’ve passed in the entries (as you’ve already specified the input scheme PETSc.InsertMode.Add)?

Thanks for taking a look, Nate.
When I do not set the NEW_NONZERO_ALLOCATION_ERR option to False I always get an error ("petsc4py.PETSc.Error: error code 63"). Even if I setPreallocatioNNZ to a ridiculously high value. I actually can’t make any sense of the memory required for preallocation vs the value(s) that I give as nonzeros. It almost seems that PETSc is fully ignoring the setPreallocatioNNZ command.
In either case, NEW_NONZERO_ALLOCATION_ERR=False shouldn’t hurt (this much) right? It is just an indicator that something is going wrong in the preallocation.

I initially had M.assemble() out of the loop in my MWE, but that was much slower (like 10 times slower). I’m guessing the amount/inefficiency of the storage required to “remember where to put values later” outweighs the cost of having to copy into M multiple times.

It’s not sufficient to only preallocate the number of non zeroes, one must also define the sparsity pattern.

It’s a pretty involved process, see for example the MatCreateAIJ documentation or PETScMatrix::init

1 Like

Thanks for pointing at the PETScMatrix source code, that’s quite instructive.

While I am sure my MWE can be optimized considerably by using more sophisticated preallocation techniques, I’d be happy getting the bare minimum to work. Getting a factor 10 speed drop after 10 iterations should be avoidable also with simple approaches.

The PETSc.Mat() instance has the following methods related to preallocation:

setPreallocationCSR
setPreallocationDense
setPreallocationNNZ

I focused on the latter, as the CSR preallocation will likely get too complex for my usecase. This method can take four different types of input: a single integer for ‘average’ nonzeros per row, an array for the nonzeros per individual row, a two-tuple for average-nz-in-diagonal-part and average-nz-in-off-diagonal-part, and finally a two-arrays-tuple with nz-diagonal-part-per-row and nz-off-diagonal-part-per-row.

For this MWE the exact sparsity pattern is known. I tried the following, to no avail:

nz_per_row_avg = A_block.nnz()//A_size
M.setPreallocationNNZ( nz_per_row_avg  )

and

nz_per_row = np.hstack( [np.asarray( [ len( A_block.getrow(row)[0] ) for row in range(A_size) ] ,np.int32)]*total_blocks )
M.setPreallocationNNZ( nz_per_row  )

and

diag_vals = 4
off_diag_vals = max(nz_per_row)
M.setPreallocationNNZ( (diag_vals,off_diag_vals) )

and

nz_per_row_diag = np.minimum(nz_per_row,diag_vals)
nz_per_row_offdiag = nz_per_row-nz_per_row_diag
M.setPreallocationNNZ( (nz_per_row_diag,nz_per_row_offdiag) )

I think I figured it out. You were right about the calls to M.assemble() being the issue. The problem was already with the first call to M.assemble() just below where I’m doing the preallocation (before even going into the loop). Apparently, that assembly forces a fixed structure of the nnz per rows, which of course doesn’t correspond at all to that which I am using later on. I had a misconception about what that assembly call actually does.
I’ll do some more tests on my desktop at home tonight to get consistent results, and then post it here in case it is of use to anyone.

edit: Here follow the results. For all computations, I’ve removed the call to M.assemble() from the loop and I’ve added a printout of M.getInfo() before and after the loop.

(1) With the default preallocation:

M.setUp()
M.assemble()

the result (with total_blocks=5) are:

{‘block_size’: 1.0, ‘nz_allocated’: 255025.0, ‘nz_used’: 0.0, ‘nz_unneeded’: 255025.0, ‘memory’: 4083628.0, ‘assemblies’: 1.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation: 3420
Time to assemble the first block: 0.39573144912719727
Time to assemble block 0 till block 5: 14.801470756530762
{‘block_size’: 1.0, ‘nz_allocated’: 1020100.0, ‘nz_used’: 353005.0, ‘nz_unneeded’: 667095.0, ‘memory’: 4083628.0, ‘assemblies’: 2.0, ‘mallocs’: 51005.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation plus assembly: 32048

(2) With the default preallocation but without the pre-assembly:

M.setUp()

the result (total_blocks=5) are:

{‘block_size’: 1.0, ‘nz_allocated’: 255025.0, ‘nz_used’: 0.0, ‘nz_unneeded’: 255025.0, ‘memory’: 3879608.0, ‘assemblies’: 0.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation: 984
Time to assemble the first block: 1.3053169250488281
Time to assemble block 0 till block 5: 28.810642957687378
{‘block_size’: 1.0, ‘nz_allocated’: 990100.0, ‘nz_used’: 353005.0, ‘nz_unneeded’: 637095.0, ‘memory’: 4083628.0, ‘assemblies’: 1.0, ‘mallocs’: 49005.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation plus assembly: 38604

(So for default preallocation, intermediate assembly calls is beneficial)

(3) With manual assembly and:

M.setPreallocationNNZ( nz_per_row_avg + 3 )

the result (total_blocks=25) are:

{‘block_size’: 1.0, ‘nz_allocated’: 2295225.0, ‘nz_used’: 0.0, ‘nz_unneeded’: 2295225.0, ‘memory’: 31626328.0, ‘assemblies’: 0.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation: 6504
Time to assemble the first block: 0.0023212432861328125
Time to assemble block 0 till block 5: 0.013171672821044922
Time to assemble block 5 till block 10: 0.011617422103881836
Time to assemble block 10 till block 15: 0.011052846908569336
Time to assemble block 15 till block 20: 0.015601873397827148
Time to assemble block 20 till block 25: 0.011806488037109375
{‘block_size’: 1.0, ‘nz_allocated’: 2295225.0, ‘nz_used’: 1765025.0, ‘nz_unneeded’: 530200.0, ‘memory’: 32646428.0, ‘assemblies’: 1.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation plus assembly: 35300

(Note the +3 in the preallocation for safety. Without that, it is extremely slow, even much slower than before. Right now it has a very significant overestimation of the nnz though.)

(4) With manual assembly and perfect per-row nnz:

M.setPreallocationNNZ( nz_per_row )

The results are:

{‘block_size’: 1.0, ‘nz_allocated’: 1765025.0, ‘nz_used’: 0.0, ‘nz_unneeded’: 1765025.0, ‘memory’: 25263928.0, ‘assemblies’: 0.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation: 6264
Time to assemble the first block: 0.0023627281188964844
Time to assemble block 0 till block 5: 0.015277862548828125
Time to assemble block 5 till block 10: 0.015308618545532227
Time to assemble block 10 till block 15: 0.010660409927368164
Time to assemble block 15 till block 20: 0.010852336883544922
Time to assemble block 20 till block 25: 0.01159214973449707
{‘block_size’: 1.0, ‘nz_allocated’: 1765025.0, ‘nz_used’: 1765025.0, ‘nz_unneeded’: 0.0, ‘memory’: 26284028.0, ‘assemblies’: 1.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation plus assembly: 29160

So despite the 530200 less unneeded preallocated nnz’s there is no serious speed improvement.

(5) With naive splitting into diagonal and off-diagonal values:

diag_vals = max(nz_per_row)
off_diag_vals = max(nz_per_row)
M.setPreallocationNNZ( (diag_vals,off_diag_vals) )

the results are:

{‘block_size’: 1.0, ‘nz_allocated’: 1785175.0, ‘nz_used’: 0.0, ‘nz_unneeded’: 1785175.0, ‘memory’: 25505728.0, ‘assemblies’: 0.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation: 5648
Time to assemble the first block: 0.002238750457763672
Time to assemble block 0 till block 5: 0.014938831329345703
Time to assemble block 5 till block 10: 0.014675378799438477
Time to assemble block 10 till block 15: 0.011755943298339844
Time to assemble block 15 till block 20: 0.015692710876464844
Time to assemble block 20 till block 25: 0.01573467254638672
{‘block_size’: 1.0, ‘nz_allocated’: 1785175.0, ‘nz_used’: 1765025.0, ‘nz_unneeded’: 20150.0, ‘memory’: 26525828.0, ‘assemblies’: 1.0, ‘mallocs’: 0.0, ‘fill_ratio_given’: 0.0, ‘fill_ratio_needed’: 0.0, ‘factor_mallocs’: 0.0}
Memory used for preallocation plus assembly: 29772

Which for some odd reason has 20x fewer “nz_unneeded” than case number (3) .

For my own project, my conclusion is that I’ll use a conservative estimation (=slight overestimation) of the number of nnz per individual row. That’s probably a good balance between speed and reliability.