Weird HDF5 issue in FEniCS container

Hi everyone,
I’m using a FEniCS SingularityCE (version 3.9.8) container built from quay.io/fenicsproject/stable:latest and I have a weird issue with XDMFFile.

Consider the following script:

# test.py
import fenics
import sys

mesh = fenics.UnitSquareMesh(250, 250)

V = fenics.FunctionSpace(mesh, "CG", 1)

function = fenics.interpolate(fenics.Constant(1.), V)

xdmf_file = fenics.XDMFFile("saved_sim/test.xdmf")

for i in range(100):
    fenics.MPI.comm_world.Barrier()
    xdmf_file.write(function, i)
    if fenics.MPI.comm_world.Get_rank() == 0:
        print(f"Step {i} done")
        sys.stdout.flush()

If I execute this script in parallel with mpirun, from time to time I get this:

HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) MPI-process 5:
  #000: ../../../src/H5Dio.c line 268 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: ../../../src/H5Dio.c line 344 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: ../../../src/H5Dio.c line 788 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: ../../../src/H5Dmpio.c line 529 in H5D__contig_collective_write(): couldn't finish shared collective MPI-IO
    major: Low-level I/O
    minor: Write failed
  #004: ../../../src/H5Dmpio.c line 1400 in H5D__inter_collective_io(): couldn't finish collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: ../../../src/H5Dmpio.c line 1444 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
  #006: ../../../src/H5Dmpio.c line 297 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
  #007: ../../../src/H5Fio.c line 196 in H5F_block_write(): write through metadata accumulator failed
    major: Low-level I/O
    minor: Write failed
  #008: ../../../src/H5Faccum.c line 827 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #009: ../../../src/H5FDint.c line 285 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #010: ../../../src/H5FDmpio.c line 1789 in H5FD_mpio_write(): MPI_File_write_at_all failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #011: ../../../src/H5FDmpio.c line 1789 in H5FD_mpio_write(): Other I/O error , error stack:
ADIOI_NFS_WRITECONTIG(71): Other I/O error Input/output error
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

Executing test.py multiple times, I estimated that the command write fails about 1% of the times.

What makes this even weirder is that I used FEniCS many times and I never had this issue before.

Does anybody has a clue on how to fix?

Thank you in advance.

For what it’s worth, your test code is running cleanly for me (running on Debian unstable, not in a container).

I wonder if it might be a library version mismatch. Were any of dolfin, hdf5 or mpi (openmpi or mpich?) recently updated for your container? If so, perhaps rebuilding will help. My hdf5 is 1.10.7, dolfin is latest head from the bitbucket repo (debian build) and openmpi is 4.1.3.

1 Like

Dear all,
Going deeper in the issue and talking with my system’s administrator I found out that the problem was not related to FEniCS, its Docker container, or Singularity.

Thank you a lot @dparsons for having spent some of your time for trying to help me. Eventually, it was just an issue related to my specific system.

For future reference: the issue is due just to the fact that the directory I was working on is mounted remotely, thus the communication is not always fast enough to allow MPI to collect all the data. Indeed, I could not reproduce the error on any other computer, and working on a local folder solved the issue.

1 Like