Parallelize zeroing rows operation

see:

and the more efficient version which is leveraging a priori information about where the non-zeros are, as shown in: Modify matrix diagonal -- dolfinx version for A.ident_zeros() - #3 by dokken