Read_meshtags causing invalid rank issues

Hi I have a 3D mesh that I’ve converted from gmsh with mesh tags. I import it as follows

from mpi4py import MPI
from dolfinx.io import XDMFFile
print('loading mesh')
with XDMFFile(MPI.COMM_WORLD, f'./mesh/AoA0_v2naca0012_AR2.xdmf', 'r') as xdmf:
    mesh = xdmf.read_mesh(name='Grid')
    MPI.COMM_WORLD.Barrier()
    print('loaded mesh', flush = True)
    ct = xdmf.read_meshtags(mesh, name='Grid')
    print('loaded mesh tags', flush = True)
MPI.COMM_WORLD.Barrier()

I launch it with mpirun -n 1 python3 test.py But for some reason the import seems to fail on some meshes, where the only change was the value I gave to the cell sizing in gmsh.model.geo.add_point(x, y, z, sizing) for a number of defined points.

Here I have provided two meshes on a google drive.

The one labelled v1 runs to completion with albeit with [WARNING] yaksa: 1 leaked handle pool objects but the one labelled v2 has the following output

loading mesh
loaded mesh
Invalid rank, error stack:
internal_Issend(118): MPI_Issend(buf=0x7ffdfdd813b3, count=1, MPI_BYTE, 1, 1, comm=0x84000001, request=0x55fb920560c4) failed
internal_Issend(78).: Invalid rank has value 1 but must be nonnegative and less than 1
Abort(943321862) on node 0 (rank 0 in comm 464): application called MPI_Abort(comm=0x84000001, 943321862) - process 0

If I don’t put flush=True on the print it doesn’t even print loaded mesh.

This occurs in a conda install from the dev version early this year and uses openmpi installed with conda and with impi installed on the cluster. I also tried this on a later spack install that uses openmpi installed on the cluster and this is the output

MPI_ERR_RANK: invalid rank
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 5 DUP FROM 3
with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

I also sometimes see this when trying a denser mesh with multiple processors spread over multiple nodes

Invalid rank, error stack:
internal_Issend(118): MPI_Issend(buf=0x7ffca70f1883, count=1, MPI_BYTE, 264, 1, comm=0xc4000001, request=0x5574f8faaad4) failed
internal_Issend(78).: Invalid rank has value 264 but must be nonnegative and less than 264
Abort(741995270) on node 188 (rank 188 in comm 416): application called MPI_Abort(comm=0xC4000001, 741995270) - process 188
Invalid rank, error stack:
internal_Issend(118): MPI_Issend(buf=0x7ffd923518c3, count=1, MPI_BYTE, 264, 1, comm=0xc4000001, request=0x563fa4f59944) failed
internal_Issend(78).: Invalid rank has value 264 but must be nonnegative and less than 264
Abort(406450950) on node 235 (rank 235 in comm 416): application called MPI_Abort(comm=0xC4000001, 406450950) - process 235
Invalid rank, error stack:
internal_Issend(118): MPI_Issend(buf=0x7ffff129ebf3, count=1, MPI_BYTE, 264, 1, comm=0xc4000001, request=0x55722c38b770) failed
internal_Issend(78).: Invalid rank has value 264 but must be nonnegative and less than 264
Abort(876212998) on node 242 (rank 242 in comm 416): application called MPI_Abort(comm=0xC4000001, 876212998) - process 242
Abort(413215375) on node 242 (rank 242 in comm 448): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(84).......................: MPI_Barrier(comm=0x84000007) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(224).................: 
MPIR_Bcast_impl(444).......................: 
MPIR_Bcast_allcomm_auto(370)...............: 
MPIR_Bcast_intra_binomial(105).............: 
MPIC_Recv(187).............................: 
MPIC_Wait(64)..............................: 
MPIR_Wait_state(886).......................: 
MPID_Progress_wait(335)....................: 
MPIDI_progress_test(158)...................: 
MPIDI_OFI_handle_cq_error(625).............: OFI poll failed (ofi_events.c:627:MPIDI_OFI_handle_cq_error:Input/output error)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 282872 RUNNING AT n688
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:14@n679] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:14@n679] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:14@n679] main (proxy/pmip.c:127): demux engine error waiting for event

Finally, I tried it with a spack install of dolfinx 0.6.0 on a different cluster with spack installed openmpi,
and it reports a segmentation violation. Although there was once where I got a mesh that wasn’t working to get past that segmentation violation and then it seg faulted on the read_meshtags for the facets.

Any insight on this would be much appreciated!

Bumping this to see if anyone has any insights. I’ve also found a mesh that works with 1 processor but doesn’t with 264 processors. Maybe it’s to do with the extra degrees of freedom from the ghost values? I don’t think it’s a memory issue though since I have 3.5TB of memory shared between the nodes and the utilisation % output is often very low.