Version-triggered (fenicsx0.8.0) PETSc error

Hi, everyone! I updated my fenicsx on Ubuntu 24.04 LTS from v0.7.0 to v0.8.0. And when I run the following paralell test case:

import numpy as np
from scipy.spatial.distance import cdist
from multiprocessing import Pool, Manager
import dolfinx
from mpi4py import MPI

def calculate_distance_batch(args):
    """
    Compute euclidean distance of a batch of points.
    Args:
        args (tuple): Tuple of batch points, all points, and a shared list to store results.
    Returns:
        None
    """
    try:
        batch_points, all_points, shared_list = args
        distances = cdist(batch_points, all_points, metric='euclidean')
        shared_list.append(distances.tolist())
    except Exception as e:
        print(f"Error in calculate_distance_batch: {e}")

def calculate_distances(radius, points, batch_size, num_processes) -> tuple:
    """
    Compute euclidean distance of all points using multiprocessing.
    Args:
        radius (float): filter radius
        points (Sequence): coordinates of points
        batch_size (int): number of batch sizes
        num_processes (int): number of processes
    Returns:
        (ndarray, ndarray), shape1=(elem_num, elem_num), shape2=(elem_num, )
    """
    num_points = len(points)
    num_batches = int(np.ceil(num_points / batch_size))
    # Create a list of process pools
    with Pool(processes=num_processes) as pool, Manager() as manager:
        shared_list = manager.list()
        # Batch calculation of Euler distance
        pool.map(calculate_distance_batch,
                [(points[i * batch_size:min((i + 1) * batch_size, num_points)], points, shared_list) for i in
                 range(num_batches)])
        # Merge result matrix
        distance_matrix = np.concatenate(shared_list)
    distance_matrix = np.maximum(radius - distance_matrix, 0)
    distance_sum = distance_matrix.sum(1)
    return distance_matrix, distance_sum
# the test case
mesh = dolfinx.mesh.create_unit_cube(MPI.COMM_WORLD, 2,2,2, cell_type=dolfinx.mesh.CellType.hexahedron)
mid_points = dolfinx.mesh.compute_midpoints(mesh=mesh, dim=3, entities=np.arange(8, dtype=np.int32))
H, H_sum =calculate_distances(radius=0.75, points=mid_points, batch_size=int( mid_points.shape[0] / 2),
        num_processes=2)

it output the errors:

[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 13 Broken Pipe: Likely while reading or writing to a socket
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.

However, the same case runs well on fenicsx0.7.0 without the above errors.

Possibly difficult to debug the problem, but first things to try (after running apt-get update).

  • full dist-upgrade (not just upgrade) to ensure all required libraries are up-to-date
    apt-get dist-upgrade
  • clean out the jit cache
    rm -f ~/.cache/fenics/

If that doesn’t help, try completely purging all fenics packages and reinstalling afresh

dpkg -P fenicsx  python3-dolfinx python3-dolfin-real libdolfinx-dev libdolfinx-real-dev libdolfinx-real0.7 libdolfinx-real0.8 python3-ffcx python3-ufl python3-basix libbasix-dev libbasix0.7 libbasix0.8

check carefully, this might not be an exhaustive list. Purge other packages as needed. Also be careful that there are no accidental local installations (~/.local/lib/python3.*/site-packages/*dolfinx* etc)
then

apt-get install fenicsx

Check that it pulls in the new v0.8 that you want.

For what it’s worth I can reproduce the PETSc error anyway from your code sample. That means the bug is probably in the code not the packages. Some of the dolfinx API was changed in 0.8. If you can activate gdb (“Try option -start_in_debugger”) you might get a backtrace that might give clues where the problem is.

This seems to be a PETSc issue, as if you call PETSc._finalize() prior to calling your function the code runs with no issue:

import numpy as np

from mpi4py import MPI
from petsc4py import PETSc

import dolfinx
from scipy.spatial.distance import cdist
from multiprocessing import Pool, Manager


def calculate_distance_batch(args):
    """
    Compute euclidean distance of a batch of points.
    Args:
        args (tuple): Tuple of batch points, all points, and a shared list to store results.
    Returns:
        None
    """
    try:
        batch_points, all_points, shared_list = args
        distances = cdist(batch_points, all_points, metric="euclidean")
        shared_list.append(distances.tolist())
    except Exception as e:
        print(f"Error in calculate_distance_batch: {e}")


def calculate_distances(radius, points, batch_size, num_processes) -> tuple:
    """
    Compute euclidean distance of all points using multiprocessing.
    Args:
        radius (float): filter radius
        points (Sequence): coordinates of points
        batch_size (int): number of batch sizes
        num_processes (int): number of processes
    Returns:
        (ndarray, ndarray), shape1=(elem_num, elem_num), shape2=(elem_num, )
    """
    num_points = len(points)
    num_batches = int(np.ceil(num_points / batch_size))
    # Create a list of process pools
    with Pool(processes=num_processes) as pool, Manager() as manager:
        shared_list = manager.list()
        # Batch calculation of Euler distance
        pool.map(
            calculate_distance_batch,
            [
                (
                    points[i * batch_size : min((i + 1) * batch_size, num_points)],
                    points,
                    shared_list,
                )
                for i in range(num_batches)
            ],
        )
        # Merge result matrix
        distance_matrix = np.concatenate(shared_list)
    distance_matrix = np.maximum(radius - distance_matrix, 0)
    distance_sum = distance_matrix.sum(1)
    return distance_matrix, distance_sum


# the test case
mesh = dolfinx.mesh.create_unit_cube(
    MPI.COMM_WORLD, 2, 2, 2, cell_type=dolfinx.mesh.CellType.hexahedron
)
mid_points = dolfinx.mesh.compute_midpoints(
    mesh=mesh, dim=3, entities=np.arange(8, dtype=np.int32)
)
PETSc._finalize()

H, H_sum = calculate_distances(
    radius=0.75,
    points=mid_points,
    batch_size=int(mid_points.shape[0] / 2),
    num_processes=2,
)

Can be reproduced with the following (without DOLFINx)

import numpy as np

from mpi4py import MPI
import sys
import petsc4py

petsc4py.init(sys.argv, comm=MPI.COMM_WORLD)

from multiprocessing import Pool, Manager


def calculate_distance_batch(args):
    pass


def calculate_distances(points, batch_size, num_processes) -> tuple:
    """
    Compute euclidean distance of all points using multiprocessing.
    Args:
        radius (float): filter radius
        points (Sequence): coordinates of points
        batch_size (int): number of batch sizes
        num_processes (int): number of processes
    Returns:
        (ndarray, ndarray), shape1=(elem_num, elem_num), shape2=(elem_num, )
    """
    num_points = len(points)
    num_batches = int(np.ceil(num_points / batch_size))
    # Create a list of process pools
    with Pool(processes=num_processes) as pool, Manager() as manager:
        shared_list = manager.list()
        # Batch calculation of Euler distance
        pool.map(
            calculate_distance_batch,
            [
                (
                    points[i * batch_size : min((i + 1) * batch_size, num_points)],
                    points,
                    shared_list,
                )
                for i in range(num_batches)
            ],
        )


mid_points = np.array(
    [
        [0.25, 0.25, 0.25],
        [0.75, 0.25, 0.25],
        [0.25, 0.75, 0.25],
        [0.25, 0.25, 0.75],
        [0.75, 0.75, 0.25],
        [0.75, 0.25, 0.75],
        [0.25, 0.75, 0.75],
        [0.75, 0.75, 0.75],
    ]
)
# Add this line to remove the error
# petsc4py.PETSc._finalize()
a = calculate_distances(
    points=mid_points,
    batch_size=int(mid_points.shape[0] / 2),
    num_processes=2,
)

where adding in
petsc4py.PETSc._finalize()
before calculate_distances the error is gone.

I am using FEniCSx 0.9.0, and the same issue occurs randomly, making the program’s ability to run entirely luck-dependent.

I don’t know the reason, but adding the following code at the beginning of the program resolves the issue. Seems it ignores SIGPIPE signals in Python bindings

signal.signal(signal.SIGPIPE, signal.SIG_IGN)