I think I figured it out. You were right about the calls to M.assemble()
being the issue. The problem was already with the first call to M.assemble()
just below where Iâm doing the preallocation (before even going into the loop). Apparently, that assembly forces a fixed structure of the nnz per rows, which of course doesnât correspond at all to that which I am using later on. I had a misconception about what that assembly call actually does.
Iâll do some more tests on my desktop at home tonight to get consistent results, and then post it here in case it is of use to anyone.
edit: Here follow the results. For all computations, Iâve removed the call to M.assemble()
from the loop and Iâve added a printout of M.getInfo()
before and after the loop.
(1) With the default preallocation:
M.setUp()
M.assemble()
the result (with total_blocks=5
) are:
{âblock_sizeâ: 1.0, ânz_allocatedâ: 255025.0, ânz_usedâ: 0.0, ânz_unneededâ: 255025.0, âmemoryâ: 4083628.0, âassembliesâ: 1.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation: 3420
Time to assemble the first block: 0.39573144912719727
Time to assemble block 0 till block 5: 14.801470756530762
{âblock_sizeâ: 1.0, ânz_allocatedâ: 1020100.0, ânz_usedâ: 353005.0, ânz_unneededâ: 667095.0, âmemoryâ: 4083628.0, âassembliesâ: 2.0, âmallocsâ: 51005.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation plus assembly: 32048
(2) With the default preallocation but without the pre-assembly:
M.setUp()
the result (total_blocks=5
) are:
{âblock_sizeâ: 1.0, ânz_allocatedâ: 255025.0, ânz_usedâ: 0.0, ânz_unneededâ: 255025.0, âmemoryâ: 3879608.0, âassembliesâ: 0.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation: 984
Time to assemble the first block: 1.3053169250488281
Time to assemble block 0 till block 5: 28.810642957687378
{âblock_sizeâ: 1.0, ânz_allocatedâ: 990100.0, ânz_usedâ: 353005.0, ânz_unneededâ: 637095.0, âmemoryâ: 4083628.0, âassembliesâ: 1.0, âmallocsâ: 49005.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation plus assembly: 38604
(So for default preallocation, intermediate assembly calls is beneficial)
(3) With manual assembly and:
M.setPreallocationNNZ( nz_per_row_avg + 3 )
the result (total_blocks=25
) are:
{âblock_sizeâ: 1.0, ânz_allocatedâ: 2295225.0, ânz_usedâ: 0.0, ânz_unneededâ: 2295225.0, âmemoryâ: 31626328.0, âassembliesâ: 0.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation: 6504
Time to assemble the first block: 0.0023212432861328125
Time to assemble block 0 till block 5: 0.013171672821044922
Time to assemble block 5 till block 10: 0.011617422103881836
Time to assemble block 10 till block 15: 0.011052846908569336
Time to assemble block 15 till block 20: 0.015601873397827148
Time to assemble block 20 till block 25: 0.011806488037109375
{âblock_sizeâ: 1.0, ânz_allocatedâ: 2295225.0, ânz_usedâ: 1765025.0, ânz_unneededâ: 530200.0, âmemoryâ: 32646428.0, âassembliesâ: 1.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation plus assembly: 35300
(Note the +3 in the preallocation for safety. Without that, it is extremely slow, even much slower than before. Right now it has a very significant overestimation of the nnz though.)
(4) With manual assembly and perfect per-row nnz:
M.setPreallocationNNZ( nz_per_row )
The results are:
{âblock_sizeâ: 1.0, ânz_allocatedâ: 1765025.0, ânz_usedâ: 0.0, ânz_unneededâ: 1765025.0, âmemoryâ: 25263928.0, âassembliesâ: 0.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation: 6264
Time to assemble the first block: 0.0023627281188964844
Time to assemble block 0 till block 5: 0.015277862548828125
Time to assemble block 5 till block 10: 0.015308618545532227
Time to assemble block 10 till block 15: 0.010660409927368164
Time to assemble block 15 till block 20: 0.010852336883544922
Time to assemble block 20 till block 25: 0.01159214973449707
{âblock_sizeâ: 1.0, ânz_allocatedâ: 1765025.0, ânz_usedâ: 1765025.0, ânz_unneededâ: 0.0, âmemoryâ: 26284028.0, âassembliesâ: 1.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation plus assembly: 29160
So despite the 530200 less unneeded preallocated nnzâs there is no serious speed improvement.
(5) With naive splitting into diagonal and off-diagonal values:
diag_vals = max(nz_per_row)
off_diag_vals = max(nz_per_row)
M.setPreallocationNNZ( (diag_vals,off_diag_vals) )
the results are:
{âblock_sizeâ: 1.0, ânz_allocatedâ: 1785175.0, ânz_usedâ: 0.0, ânz_unneededâ: 1785175.0, âmemoryâ: 25505728.0, âassembliesâ: 0.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation: 5648
Time to assemble the first block: 0.002238750457763672
Time to assemble block 0 till block 5: 0.014938831329345703
Time to assemble block 5 till block 10: 0.014675378799438477
Time to assemble block 10 till block 15: 0.011755943298339844
Time to assemble block 15 till block 20: 0.015692710876464844
Time to assemble block 20 till block 25: 0.01573467254638672
{âblock_sizeâ: 1.0, ânz_allocatedâ: 1785175.0, ânz_usedâ: 1765025.0, ânz_unneededâ: 20150.0, âmemoryâ: 26525828.0, âassembliesâ: 1.0, âmallocsâ: 0.0, âfill_ratio_givenâ: 0.0, âfill_ratio_neededâ: 0.0, âfactor_mallocsâ: 0.0}
Memory used for preallocation plus assembly: 29772
Which for some odd reason has 20x fewer ânz_unneededâ than case number (3) .
For my own project, my conclusion is that Iâll use a conservative estimation (=slight overestimation) of the number of nnz per individual row. Thatâs probably a good balance between speed and reliability.