Version-triggered (fenicsx0.8.0) PETSc error

lyunfei1211 · July 17, 2024, 2:16am

Hi, everyone! I updated my fenicsx on Ubuntu 24.04 LTS from v0.7.0 to v0.8.0. And when I run the following paralell test case:

import numpy as np
from scipy.spatial.distance import cdist
from multiprocessing import Pool, Manager
import dolfinx
from mpi4py import MPI

def calculate_distance_batch(args):
    """
    Compute euclidean distance of a batch of points.
    Args:
        args (tuple): Tuple of batch points, all points, and a shared list to store results.
    Returns:
        None
    """
    try:
        batch_points, all_points, shared_list = args
        distances = cdist(batch_points, all_points, metric='euclidean')
        shared_list.append(distances.tolist())
    except Exception as e:
        print(f"Error in calculate_distance_batch: {e}")

def calculate_distances(radius, points, batch_size, num_processes) -> tuple:
    """
    Compute euclidean distance of all points using multiprocessing.
    Args:
        radius (float): filter radius
        points (Sequence): coordinates of points
        batch_size (int): number of batch sizes
        num_processes (int): number of processes
    Returns:
        (ndarray, ndarray), shape1=(elem_num, elem_num), shape2=(elem_num, )
    """
    num_points = len(points)
    num_batches = int(np.ceil(num_points / batch_size))
    # Create a list of process pools
    with Pool(processes=num_processes) as pool, Manager() as manager:
        shared_list = manager.list()
        # Batch calculation of Euler distance
        pool.map(calculate_distance_batch,
                [(points[i * batch_size:min((i + 1) * batch_size, num_points)], points, shared_list) for i in
                 range(num_batches)])
        # Merge result matrix
        distance_matrix = np.concatenate(shared_list)
    distance_matrix = np.maximum(radius - distance_matrix, 0)
    distance_sum = distance_matrix.sum(1)
    return distance_matrix, distance_sum
# the test case
mesh = dolfinx.mesh.create_unit_cube(MPI.COMM_WORLD, 2,2,2, cell_type=dolfinx.mesh.CellType.hexahedron)
mid_points = dolfinx.mesh.compute_midpoints(mesh=mesh, dim=3, entities=np.arange(8, dtype=np.int32))
H, H_sum =calculate_distances(radius=0.75, points=mid_points, batch_size=int( mid_points.shape[0] / 2),
        num_processes=2)

it output the errors:

[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 13 Broken Pipe: Likely while reading or writing to a socket
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.

However, the same case runs well on fenicsx0.7.0 without the above errors.

dparsons · July 17, 2024, 3:01pm

Possibly difficult to debug the problem, but first things to try (after running apt-get update).

full dist-upgrade (not just upgrade) to ensure all required libraries are up-to-date
apt-get dist-upgrade
clean out the jit cache
rm -f ~/.cache/fenics/

If that doesn’t help, try completely purging all fenics packages and reinstalling afresh

dpkg -P fenicsx  python3-dolfinx python3-dolfin-real libdolfinx-dev libdolfinx-real-dev libdolfinx-real0.7 libdolfinx-real0.8 python3-ffcx python3-ufl python3-basix libbasix-dev libbasix0.7 libbasix0.8

check carefully, this might not be an exhaustive list. Purge other packages as needed. Also be careful that there are no accidental local installations (~/.local/lib/python3.*/site-packages/*dolfinx* etc)
then

apt-get install fenicsx

Check that it pulls in the new v0.8 that you want.

dparsons · July 17, 2024, 3:05pm

For what it’s worth I can reproduce the PETSc error anyway from your code sample. That means the bug is probably in the code not the packages. Some of the dolfinx API was changed in 0.8. If you can activate gdb (“Try option -start_in_debugger”) you might get a backtrace that might give clues where the problem is.

dokken · August 25, 2024, 7:18pm

This seems to be a PETSc issue, as if you call PETSc._finalize() prior to calling your function the code runs with no issue:

import numpy as np

from mpi4py import MPI
from petsc4py import PETSc

import dolfinx
from scipy.spatial.distance import cdist
from multiprocessing import Pool, Manager


def calculate_distance_batch(args):
    """
    Compute euclidean distance of a batch of points.
    Args:
        args (tuple): Tuple of batch points, all points, and a shared list to store results.
    Returns:
        None
    """
    try:
        batch_points, all_points, shared_list = args
        distances = cdist(batch_points, all_points, metric="euclidean")
        shared_list.append(distances.tolist())
    except Exception as e:
        print(f"Error in calculate_distance_batch: {e}")


def calculate_distances(radius, points, batch_size, num_processes) -> tuple:
    """
    Compute euclidean distance of all points using multiprocessing.
    Args:
        radius (float): filter radius
        points (Sequence): coordinates of points
        batch_size (int): number of batch sizes
        num_processes (int): number of processes
    Returns:
        (ndarray, ndarray), shape1=(elem_num, elem_num), shape2=(elem_num, )
    """
    num_points = len(points)
    num_batches = int(np.ceil(num_points / batch_size))
    # Create a list of process pools
    with Pool(processes=num_processes) as pool, Manager() as manager:
        shared_list = manager.list()
        # Batch calculation of Euler distance
        pool.map(
            calculate_distance_batch,
            [
                (
                    points[i * batch_size : min((i + 1) * batch_size, num_points)],
                    points,
                    shared_list,
                )
                for i in range(num_batches)
            ],
        )
        # Merge result matrix
        distance_matrix = np.concatenate(shared_list)
    distance_matrix = np.maximum(radius - distance_matrix, 0)
    distance_sum = distance_matrix.sum(1)
    return distance_matrix, distance_sum


# the test case
mesh = dolfinx.mesh.create_unit_cube(
    MPI.COMM_WORLD, 2, 2, 2, cell_type=dolfinx.mesh.CellType.hexahedron
)
mid_points = dolfinx.mesh.compute_midpoints(
    mesh=mesh, dim=3, entities=np.arange(8, dtype=np.int32)
)
PETSc._finalize()

H, H_sum = calculate_distances(
    radius=0.75,
    points=mid_points,
    batch_size=int(mid_points.shape[0] / 2),
    num_processes=2,
)

Can be reproduced with the following (without DOLFINx)

import numpy as np

from mpi4py import MPI
import sys
import petsc4py

petsc4py.init(sys.argv, comm=MPI.COMM_WORLD)

from multiprocessing import Pool, Manager


def calculate_distance_batch(args):
    pass


def calculate_distances(points, batch_size, num_processes) -> tuple:
    """
    Compute euclidean distance of all points using multiprocessing.
    Args:
        radius (float): filter radius
        points (Sequence): coordinates of points
        batch_size (int): number of batch sizes
        num_processes (int): number of processes
    Returns:
        (ndarray, ndarray), shape1=(elem_num, elem_num), shape2=(elem_num, )
    """
    num_points = len(points)
    num_batches = int(np.ceil(num_points / batch_size))
    # Create a list of process pools
    with Pool(processes=num_processes) as pool, Manager() as manager:
        shared_list = manager.list()
        # Batch calculation of Euler distance
        pool.map(
            calculate_distance_batch,
            [
                (
                    points[i * batch_size : min((i + 1) * batch_size, num_points)],
                    points,
                    shared_list,
                )
                for i in range(num_batches)
            ],
        )


mid_points = np.array(
    [
        [0.25, 0.25, 0.25],
        [0.75, 0.25, 0.25],
        [0.25, 0.75, 0.25],
        [0.25, 0.25, 0.75],
        [0.75, 0.75, 0.25],
        [0.75, 0.25, 0.75],
        [0.25, 0.75, 0.75],
        [0.75, 0.75, 0.75],
    ]
)
# Add this line to remove the error
# petsc4py.PETSc._finalize()
a = calculate_distances(
    points=mid_points,
    batch_size=int(mid_points.shape[0] / 2),
    num_processes=2,
)

where adding in
petsc4py.PETSc._finalize()
before calculate_distances the error is gone.

shaoyaoqian · June 12, 2025, 2:14am

I am using FEniCSx 0.9.0, and the same issue occurs randomly, making the program’s ability to run entirely luck-dependent.

I don’t know the reason, but adding the following code at the beginning of the program resolves the issue. Seems it ignores SIGPIPE signals in Python bindings

signal.signal(signal.SIGPIPE, signal.SIG_IGN)

Topic		Replies	Views
How do I resolve issue with calling the PETSc function 'KSPSolve', with a PETSc error code of 76, indicating an error in an external library? General	4	172	May 16, 2024
AttributeError when using dolfinx.fem.petsc.LinearProblem Errors petsc	1	74	January 16, 2025
Assembling Process General	1	208	May 21, 2024
[dolfin] transfer matrix crash with petsc 3.20 Errors	0	204	November 2, 2023
PETSC error number 11 when running demo_poisson.py Errors	7	940	June 16, 2023

Version-triggered (fenicsx0.8.0) PETSc error

Related topics