Is it possible to assemble in parallel without partitioning?

Hello, I am curious if it is possible to parallelize assembly/solving (using petsc backend) in fenics without partitioning the mesh? The tl;dr is I am trying to parallelize a mixed-domain system but there is a bug in the assembly code when the mesh is partitioned - I have tried to tackle this but I’m realizing it is a bit beyond my depth.

The workaround I was thinking of is that since the problem is already compartmentalized by having mixed-domains (ie it has a block Jacobian and residual vector), and each block is constructed one at a time, it seems feasible to send the assembly of different blocks to different processors.

This is kind of what I am thinking of (slightly pseudo-code) but I’m not sure how to actually implement it:

import dolfin as d
import petsc4py as p

# I believe I need to specify the MPI communicator here to ensure it doesn't partition? 
mesh = d.Mesh('my_mesh.h5') 
V    = FunctionSpace(mesh, ...)
u    = Function(V)                      # If mesh is partitioned than this will also be partitioned 
v    = TestFunction(V)
F    = u*v*dx + ...                     # my monolithic residual vector
J    = d.derivative(F,u)                # Jacobian

Fblocks = get_blocks(F)                 # Monolithic residual vector separated by domain (j)
Jblocks = get_blocks(d.derivative(F,u)) # Jacobian matrix separated by F domain (i), u domain (j), domain of integration (k) (Jij is the sum of Jijk for all k)

# =============== Serial version of mixed assembly + SNES code ===============
pF = p.Vec() # petsc4py Vector for F
pu = p.Vec() # petsc4py Vector for u
pJ = p.NestMat() # petsc4py Nest Matrix for J 

class snes_problem(pF, pu, pJ):
    def assemble_F:
        for Fj in Fblocks:
            d.assemble_mixed(Fj, tensor=pF)
    def assemble_J:
        for Jijk in Jblocks:
            d.assemble_mixed(Jijk, tensor=pJ)


# =============== Parallel version of mixed assembly + SNES code??? ===============
class snes_problem(pF, pu, pJ):
    def assemble_F:
        d.assemble_mixed(Fblocks[rank], tensor=pF)
    def assemble_J:
        for j in range(len(Jblocks[rank])):
            d.assemble_mixed(Jblocks[rank][j], tensor=pJ)


Does something like this make sense or is would the cost of communicating outweigh any benefits? I have a lot to learn regarding MPI but I’ve read that you can use one-sided communication to create an effective shared memory - could each of the other CPUs use this to access the assembly instructions? Or is it possible to generate all the FFC files on root and then tell each processor to use a specific one? It seems like parallelizing over block matrices could easily lead to imbalanced loads and inefficient memory sharing, but I’m guessing something like this will still be better than serial on large problems.

I am curious if it is common to use mesh-partitioning-free parallel assemblies such as this or are they just too inefficient? Once the assembly is complete it seems like SNES should be able to solve in parallel. If anyone has experience in this or could point me to an example I’d really appreciate it, thank you!

BTW: I wanted to keep this example to a minimum so I replaced some calls with psuedo-code but I’m happy to elaborate. I have working code for solving mixed-dimensional non-linear problems using SNES but there are a lot of details that might be unnecessary.