PETSC ERROR: Caught signal number 4 Illegal instruction: Likely due to memory corruption

Hi,

I am using fenicsx in python using anaconda as a package manager. After getting my code to work I have been running my code on a university cluster, which I have had success with until now. To avoid errors I use exactly the same code and install instructions (repositories etc.) on the cluster as on my home computer. I have no errors when running my code on my computer. And until now I have had no errors running it on the cluster.

However, I am now getting this error, fairly sporadically, when trying to run my code on the cluster

[0]PETSC ERROR: Caught signal number 4 Illegal instruction: Likely due to memory corruption
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

I know that this is a vague query - replicating the error is somewhat clunky on the cluster- and I apologise for that, but I was wondering if there were some obvious possibilities as to how/why this error was appearing and how I could get my code to run more reliably on the cluster?

Thanks,

R.

For what it’s worth. I tried replicating the error systematically but cannot do so.

I am submitting jobs via a batch protocol, and some of them fail with this error. I have the ability to run a job interactively, which I have done for a case where the simulation failed almost immediately in a batch job. In this case the job doesn’t fail or produce this error and the interactive session hits its wall time.

Does anybody have any insight into how a job scheduler might be leading to this kind of problem? The cluster uses PBS.

Again, apologies for the vagueness of this query.

Without a minimal reproducible example, it is hard for others to answer your question.

From my side:

As you used the same code but with error occurred in another machine, have you checked the version of DOFLIN or DOLFINx you used in different places are the same? To check this, you can do the following in your different machines.

import dolfinx
print(dolfinx.__version__)

Note it is recommended to use conda or docker to install FEniCS or FEniCSx.

This is a PETSc error especially when you are playing self-cooked matrix (PETSc.Mat) (e.g. aseemble_matrix) and (PETSc.Vec) (e.g. assemble_vector). Try to narrow your search region to any PETSc-related part and make a minimal reproducible example.

Your error is independent to the batch scheduler (PBS/Slurm, etc…). You can completely reproduce this on your Linux computer.

Yeah, I know, its difficult. Sorry, but thanks for your effort here.

To be clear, I have installed both with conda and both have the same version numbers, which is fenicsx 0.7.3.

Further, I make no such interactions with PETSc at all anywhere in my code. It is only fenicsx calls.

I have been playing about further and I am starting to suspect that it is going wrong when I have a certain number of instances running at the same time.

To evidence this I tried breaking up my batch job (of say 100 runs), into small chunks (say groups of 5). I would progressively run each chunk with no problem, and then all of a sudden it would reach a point where this error was thrown.

If I then go into the interactive session (with the previous chunks still running), any set of parameters would cause the issue.

I’ve never come across something like this before. Does either python or something else used by fenicsx/petsc have a plausible “maximum number of the same script running at once” as some kind of issue/limitation?

I am not sure this is related to the cache issue, see if the following post helps:

I used FEniCSx and once submitted 100 jobs simultaneously on HPC. This resulted that the running speed of each job got much slower than that of a single submission. If I gradually killed the jobs, the resting jobs gradually speeded up. This is because HPC uses a shared a hard drive (HOME folder). The default cache is stored in the home folder, which is shared and used by all runs. If redirect the cache to a unique location for each job, then it has no negative influence on running speed. But I am not sure if this also can alleviate the suspected memory issue.

I also know these kinds of PETSc errors, and they occur not always at the same time but somehow randomly and system-dependent. It’s been almost always the case when wrongly initializing PETSc objects, or when PETSc objects are not destroyed correctly and leak.
Guess it would help to see the code to get any further insights.

I guess I’m just confused as to how that would make it depend on how many instances are already running? This seems very strange to me.

I will try and produce a minimal working example to see whats going on - I emphasise that what I am doing is really vanilla fenicsx - it is using the nonlinear solver, but I make no manual calls to or define options for anything to do with PETSc.

But in the mean time more info:

Running the interactive job (whilst the batches are running) with 16x as much memory as the batch processes run with (which was already comfortably enough), and with a tiny mesh, still leads to the error - I don’t think there is locally a memory leak - I just don’t know what I mean, exactly, by local - because again this dependence on other running processes is novel/weird to me. What kind of architecture would do this? Why would it be written this way?

Also, it is failing on the very first call the .solve that I use, so there doesn’t appear to be much opportunity to build up a leak to the point of failure.

As even more of a smoking gun for the “multiple processes at once” error:

When I stop all the batch jobs, then it runs for all parameters just fine.

At this point I’m reasonably convinced it is to do, somehow, with how many independent instances of the script being run simultaneously.

I tried the json file suggestion above, but can’t see any evidence that the cache location that I am providing is being used (the folder is empty).

Without having an example, there is no way of giving you further guidance.

Ok I have a reasonably minimal example, below.

Key facts:

  1. I can replicate the error only when I include the non-linear advection terms (their lines are noted in comments)
  2. The original behaviour still persists: I can run the code just fine with or without the non-linear terms for some given number of runs. But running them in a big batch causes the error. Specifically, I have two batches of around 200 each running currently. Adding a batch over ~60 now causes the error.
  3. It uses the nonlinear_assembly code that is part of dolfinx_mpc

Code:

from mpi4py import MPI
from dolfinx.fem import Function, functionspace
from dolfinx.fem.petsc import NonlinearProblem
from dolfinx.nls.petsc import NewtonSolver
from dolfinx.mesh import (create_interval, locate_entities_boundary, meshtags)
from basix.ufl import (element, mixed_element)
import numpy as np
import math
from ufl import (TestFunctions, dx, inner, split, derivative, ln)
import dolfinx_mpc
from nonlinear_assembly import *
from pathlib import Path
import sys
import time
import os

comm = MPI.COMM_WORLD

########################################
########## ARGUMENTS + PATH ############
########################################

id_str = sys.argv[1]
dirStr = "data/" +id_str + "/"
os.makedirs(os.path.dirname(dirStr), exist_ok = True)

########################################
############## OPEN FILES ##############
########################################

if comm.rank == 0:  
    outfileStr = dirStr + "out.dat"
    outfile = open(outfileStr, "w")

########################################
############### CONSTANTS ##############
########################################

comm = MPI.COMM_WORLD

mesh_size = 1000
dt = 0.01
T = 10000
loop_count = 0
output_step =  int(1 / dt) # number of timesteps per output interval

# domain
x_left = 0.0
x_right = 100.0

########################################
########## INITIAL CONDITIONS ##########
########################################

def f1_init(x):
    values = np.zeros((1, x.shape[1]))
    for i,v in enumerate(x[0]):
        values[0][i] = 0.5 + 0.1 * np.sin(x[0][i] * 2 * np.pi / 100)
    return values

def f2_init(x):
    values = np.zeros((1, x.shape[1]))
    for i,v in enumerate(x[0]):
        values[0][i] = 0.1 * np.sin(x[0][i] * 2 * np.pi / 100)
    return values

########################################
########## BOUNDARY FUNCTIONS ##########
########################################

def PeriodicBoundary(x):
    return np.isclose(x[0], x_right)

def periodic_relation(x):
    out_x = np.zeros(x.shape)
    out_x[0] = x_right - x[0]
    return out_x

########################################
######## FUNCTION SPACE OBJECTS ########
########################################

# Create mesh and define function space

msh = create_interval(comm, points=(x_left, x_right), nx=mesh_size, ghost_mode=dolfinx.cpp.mesh.GhostMode.shared_facet)
P1 = element("Lagrange", msh.basix_cell(),1)
V = functionspace(msh, mixed_element([P1, P1]))
V_0, _ = V.sub(0).collapse()
num_points_local = V_0.dofmap.index_map.size_local * V_0.dofmap.index_map_bs
bcs = [] # no other boundary conditions

facets = locate_entities_boundary(msh, msh.topology.dim - 1, PeriodicBoundary)
arg_sort = np.argsort(facets)
mt = meshtags(msh, msh.topology.dim - 1, facets[arg_sort], np.full(len(facets), 2, dtype = np.int32))
mpc = dolfinx_mpc.MultiPointConstraint(V)
mpc.create_periodic_constraint_topological(V.sub(0), mt, 2, periodic_relation, bcs, 1) # constraint on f1
mpc.create_periodic_constraint_topological(V.sub(1), mt, 2, periodic_relation, bcs, 1) # constraint on psi
mpc.finalize()

u = Function(V)  # current solution
u_prev = Function(V)  # solution from previous converged step

f1, f2 = split(u)
f1_prev, f2_prev = split(u_prev)

xvals = sorted(msh.geometry.x[:,0])
dx_mesh = xvals[1] - xvals[0]

# all non-autonomous variables must be declared as constant on the mesh to avoid recompilation
t = dolfinx.fem.Constant(msh, 0.0)
del_t = dolfinx.fem.Constant(msh, dt)

########################################
########## INITIAL CONDITIONS ##########
########################################

u.sub(0).interpolate(f1_init)
u.sub(1).interpolate(f2_init)

u.x.scatter_forward()    

########################################
########### WEAK FORMULATION ###########
########################################

theta = 0.5

qr, qa = TestFunctions(V)

f1_mid = (1.0 - theta) * f1_prev + theta * f1
f2_mid = (1.0 - theta) * f2_prev + theta * f2
nonlin_f2_mid = (1.0 - theta) * f1_prev * (1 - f1_prev) + theta * f1 * (1 - f1)
nonlin_f1_mid = (1.0 - theta) * f2_prev * (1 - f1_prev) + theta * f2 * (1 - f1)

F = \
( \
# time derivative
inner (f1 - f1_prev, qr) / dt \
+ inner (f2 - f2_prev, qa) / dt \
# diffusion terms
+ inner ((f1_mid).dx(0), (qr).dx(0)) \
+ inner ((f2_mid).dx(0), (qa).dx(0)) \
# advection terms
+ 10.0 * inner ((nonlin_f1_mid).dx(0), qr) \
+ 10.0 * inner ((nonlin_f2_mid).dx(0), qa) \
#
) * dx

J = derivative (F, u)

jit_options = {"cffi_extra_compile_args": ["-Ofast", "-march=native"], "cffi_libraries": ["m"]}

F_compiled = dolfinx.fem.form(F, jit_options=jit_options)
J_compiled = dolfinx.fem.form(J, jit_options=jit_options)

problem = NonlinearMPCProblem(F_compiled, u, mpc, bcs = bcs, J = J_compiled, jit_options = jit_options)
solver = NewtonSolverMPC(msh.comm, problem, mpc)

solver.rtol = 1e-7 

u_prev.x.array[:] = u.x.array

########################################
########## INTEGRATE/TIME LOOP #########
########################################

while (t.value < T):
    
    t.value += del_t.value
    loop_count += 1
    r = solver.solve(u)

    if comm.rank == 0:
        if (loop_count % output_step == 0):   
            print('on time ' + str(t.value), file = outfile, flush = True)
        
    u_prev.x.array[:] = u.x.array

########################################
################ TIDY UP ###############
########################################

if comm.rank == 0:    
    outfile.close()

How does your submission script look like?

You also use DOLFINX-MPC, which I’m the maintainer of. How did you install it on top of DOLFINx, and what is in your file:

as you haven’t provided it.

Sure,

My submission script looks like this:

#!/bin/bash
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -l walltime=1:00:00
#PBS -j oe
#PBS -J 1-200

cd ${PBS_O_WORKDIR}
module load openmpi/4.1.4
source /home/xxxx/miniconda3/bin/activate
conda activate fenicsx-env
mpirun -np 1 python3 ./minimal.py $PBS_ARRAY_INDEX

I installed doflinx-mpc through conda in the same manner as I installed dolfinx. I believe I installed both at the same time with command

conda install -c conda-forge fenics-dolfinx mpich
conda install -c conda-forge dolfinx_mpc

And the contents of the nonlinear_assembly file are verbatim from the file which I was linked to in another thread. It’s contents reads:

# Copyright (C) 2022 Nathan Sime
#
# This file is part of DOLFINX_MPC
#
# SPDX-License-Identifier:    MIT
from __future__ import annotations

from mpi4py import MPI
from petsc4py import PETSc

import dolfinx
import dolfinx.fem.petsc
import dolfinx.la as _la
import dolfinx.nls.petsc
import numpy as np

import dolfinx_mpc


class NonlinearMPCProblem(dolfinx.fem.petsc.NonlinearProblem):

    def __init__(self, F, u, mpc, bcs=[], J=None, form_compiler_options={},
                 jit_options={}):
        self.mpc = mpc
        super().__init__(F, u, bcs=bcs, J=J,
                         form_compiler_options=form_compiler_options,
                         jit_options=jit_options)

    def F(self, x: PETSc.Vec, F: PETSc.Vec):  # type: ignore
        with F.localForm() as F_local:
            F_local.set(0.0)
        dolfinx_mpc.assemble_vector(self._L, self.mpc, b=F)

        # Apply boundary condition
        dolfinx_mpc.apply_lifting(F, [self._a], bcs=[self.bcs],
                                  constraint=self.mpc, x0=[x], scale=dolfinx.default_scalar_type(-1.0))
        F.ghostUpdate(addv=PETSc.InsertMode.ADD, mode=PETSc.ScatterMode.REVERSE)  # type: ignore
        dolfinx.fem.petsc.set_bc(F, self.bcs, x, -1.0)

    def J(self, x: PETSc.Vec, A: PETSc.Mat):  # type: ignore
        A.zeroEntries()
        dolfinx_mpc.assemble_matrix(self._a, self.mpc, bcs=self.bcs, A=A)
        A.assemble()


class NewtonSolverMPC(dolfinx.cpp.nls.petsc.NewtonSolver):
    def __init__(self, comm: MPI.Intracomm, problem: NonlinearMPCProblem,
                 mpc: dolfinx_mpc.MultiPointConstraint):
        """A Newton solver for non-linear MPC problems."""
        super().__init__(comm)
        self.mpc = mpc
        self.u_mpc = dolfinx.fem.Function(mpc.function_space)

        # Create matrix and vector to be used for assembly of the non-linear
        # MPC problem
        self._A = dolfinx_mpc.cpp.mpc.create_matrix(
            problem.a._cpp_object, mpc._cpp_object)
        self._b = _la.create_petsc_vector(
            mpc.function_space.dofmap.index_map,
            mpc.function_space.dofmap.index_map_bs)

        self.setF(problem.F, self._b)
        self.setJ(problem.J, self._A)
        self.set_form(problem.form)
        self.set_update(self.update)

    def update(self, solver: dolfinx.nls.petsc.NewtonSolver,
               dx: PETSc.Vec, x: PETSc.Vec):  # type: ignore
        # We need to use a vector created on the MPC's space to update ghosts
        self.u_mpc.vector.array = x.array_r
        self.u_mpc.vector.axpy(-1.0, dx)
        self.u_mpc.vector.ghostUpdate(addv=PETSc.InsertMode.INSERT,  # type: ignore
                                      mode=PETSc.ScatterMode.FORWARD)  # type: ignore
        self.mpc.homogenize(self.u_mpc)
        self.mpc.backsubstitution(self.u_mpc)
        x.array = self.u_mpc.vector.array_r
        x.ghostUpdate(addv=PETSc.InsertMode.INSERT,  # type: ignore
                      mode=PETSc.ScatterMode.FORWARD)  # type: ignore

    def solve(self, u: dolfinx.fem.Function):
        """Solve non-linear problem into function u. Returns the number
        of iterations and if the solver converged."""
        n, converged = super().solve(u.vector)
        u.x.scatter_forward()
        return n, converged

    @property
    def A(self) -> PETSc.Mat:  # type: ignore
        """Jacobian matrix"""
        return self._A

    @property
    def b(self) -> PETSc.Vec:  # type: ignore
        """Residual vector"""
        return self._b

You are stating that you installed dolfinx with mpich, But you are using openmpi in your submission script:

There might Also be some destroy statements missing from the newton-solver. PETSc post 3.16 has made garbage collection quite painful.

I think I was mistaken re mpich - I was grabbing that from an old readme file which I will have copied from somewhere. The output of conda list reveals

mpi                       1.0                     openmpi    conda-forge
mpi4py                    3.1.5           py311h4267d7f_1    conda-forge
openmpi                   4.1.6              hc5af2df_101    conda-forge

not mpich.

Understood, re newton solver - it makes some sense given the error arises on inclusion of the non-linear terms. It is still a slightly mysterious presentation, though.

Any possible work arounds here? Or something that I can attempt?

I’ll see if i can reproduce it (will be early next week).

Hi, just wondering if you were able to replicate the error?