I am trying to build a Singularity container with FEniCS installed, specifically for deploying on a HPC cluster here. Although there are existing docker images already available, I wanted to configure the container with several other programs that I need in addition to FEniCS.
The complete recipe file (equivalent of a Dockerfile) is here
I am running the same commands as I would on an Ubuntu:18.04 system (I have already installed and tested FEniCS using these and it is working fine):
but when I try to import DOLFIN I get the following error:
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:
plm_rsh_agent: ssh : rsh
Please either unset the parameter, or check that the path is correct
[golubh1.campuscluster.illinois.edu:45463] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 582
[golubh1.campuscluster.illinois.edu:45463] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_mpi_init: ompi_rte_init failed
--> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Any pointers to what could be potentially going wrong and possible fixes?
Thanks for pointing it out. Singularity hub has been archived. You can follow the build recipe from there, and modify to pull Dolfin instead of dolfinx. something along these lines should work…
I followed the command (singularity pull --name fenics.simg docker://quay.io/fenicsproject/dev:latest) to install fenics on hpc cluster. But when I tried to run a test, I got the following error message:
ERROR: could not import mpi4py!
Traceback (most recent call last):
File “heat_class.py”, line 2, in
from fenics import *
File “/usr/local/lib/python3.6/dist-packages/fenics/init.py”, line 7, in
from dolfin import *
File “/usr/local/lib/python3.6/dist-packages/dolfin/init.py”, line 144, in
from .fem.assembling import (assemble, assemble_system, assemble_multimesh, assemble_mixed,
File “/usr/local/lib/python3.6/dist-packages/dolfin/fem/assembling.py”, line 34, in
from dolfin.fem.form import Form
File “/usr/local/lib/python3.6/dist-packages/dolfin/fem/form.py”, line 12, in
from dolfin.jit.jit import dolfin_pc, ffc_jit
File “/usr/local/lib/python3.6/dist-packages/dolfin/jit/jit.py”, line 121, in
def compile_class(cpp_data, mpi_comm=MPI.comm_world):
RuntimeError: Error when importing mpi4py
Do you have any suggestion on how to fix this?
Thanks for the quick response! I tried the following command:
singularity pull --name fenics.simg docker://ghcr.io/scientificcomputing/fenics
I got the following error:
FATAL: While making image from oci registry: error fetching image to cache: failed to get checksum for docker://ghcr.io/scientificcomputing/fenics: Error reading manifest latest in ghcr.io/scientificcomputing/fenics: manifest unknown.
Then if I tried the second one, with singularity pull --name fenics.simg docker://numericalpdes/base_images:fenics. The error message says:
FATAL: While making image from oci registry: error fetching image to cache: while building SIF from layers: conveyor failed to get: Error initializing source oci:/home/dli292/.singularity/cache/blob:02e639b5e8e21bafdbdf6684b83ed8f924f2e32d1445eb1e77105d0771a4649b: Error writing blob: write /home/dli292/.singularity/cache/blob/oci-put-blob469916469: disk quota exceeded.
FATAL: While making image from oci registry: error fetching image to cache: failed to get checksum for docker://ghcr.io/scientificcomputing/fenics: Error reading manifest latest in ghcr.io/scientificcomputing/fenics: manifest unknown.
Maybe docker:// is expecting to pull the image from docker hub, rather than github container registry? Have a look into the singularity documentation if that is the case, and how to change the container registry
FATAL: While making image from oci registry: error fetching image to cache: while building SIF from layers: conveyor failed to get: Error initializing source oci:/home/dli292/.singularity/cache/blob:02e639b5e8e21bafdbdf6684b83ed8f924f2e32d1445eb1e77105d0771a4649b: Error writing blob: write /home/dli292/.singularity/cache/blob/oci-put-blob469916469: disk quota exceeded.
That is clearly unrelated. You have run out of disk space in your home. Either clean up old attempts in the singularity cache, or ask your admin for more space.
Thank you for the response. I tried what Dokken suggested and that fixed the first issue, but still got the original issue regarding mpi4py when I ran a test. Then I tried the second image from numericalpdes/base_images:fenics, it will get rid of the mpi4py issue. When I tried to run a test, now the issue looks like:
I am afraid I will not be able to help very much further, because I don’t know how singularity works and I have never used it. Still, looking at the picture it seems to me that mpi is getting loaded from the existing environment /mnt/gpfs3_amd/... rather than the docker image.