Dolfinx petsc error for large mesh

leshinka · July 18, 2023, 9:18pm

Thank you, I will revert back to using mpirun/mpiexec and see if that helps with the problem as I clarify the reasons for the advisory https://docs.pace.gatech.edu/phoenix_cluster/slurm_conversion_phnx/.

dokken · July 18, 2023, 9:26pm

The webpage says:

srun is the standard SLURM command to start an MPI program. It automatically uses the allocated job resources: nodelist, tasks, logical cores per task. Do not use mpirun.

This means that srun is equivalent to using mpirun, but you need to set the number of processes you want to use on the node (number of tasks), see for instance https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7287338/How-to+Submit+a+MPI+Job

#SBATCH --nodes=3
#SBATCH --ntasks-per-node=28
This will result in an MPI job with 84 processes, with 28 processes on each of 3 nodes. In this case, we’ve chosen 28 tasks per node with the knowledge that each node has 28 CPU cores. Although it’s not necessary to use all cores on a node, this is often efficient, since more communication between processes happens on the same node.

leshinka · July 18, 2023, 9:50pm

Thank you for pointing me to the resource. Will update if I get resolution.

fluid · August 7, 2023, 12:48pm

Any updates @leshinka ? Could you solve the issue? Thanks

leshinka · August 29, 2023, 4:07pm

Sorry for the belated reply. I was on break. I made changes to my scripts according to the suggestions by @dokken, but that did not solve the issue. I found alternative ways to do the analysis that I wanted without relying on the large mesh, so I have stalled further work on the problem until I have more bandwidth/inspiration. I built my mesh in gmsh before converting to xdmf format. Is it possible that the mesh construction process could have some bearing on the issue we face?

dokken · August 29, 2023, 4:19pm

I wouldnt think so. I would suggest first checking how many cells, dofs etc is per process. If this number is too large (like running all on one process, one cannot create the matrix with a standard petsc installation).

leshinka · August 29, 2023, 4:42pm

Indeed I have been running all on one process (specifying one node) since that simplified the downstream post-processing of the results (all assembled on one process, and get one report logged). Is there a way to constrain all the post-processing to happen on one process? For example, will nesting the post-processing in (where domain is the mesh object)

if domain.comm.rank == 0:
    insulated_area = fem.assemble_scalar(fem.form(1 * ds(markers.insulated)))

give accurate results if the original problem is run on several processes?

dokken · August 29, 2023, 5:32pm

A trivial MPI allreduce or reduce, as done in
https://jsdokken.com/dolfinx-tutorial/chapter2/heat_code.html?highlight=allreduce#verifying-the-numerical-solution
Ie

insulated_area = fem.form(1*ds(markers.insulated))
local_area = fem.assemble_scalar(insulated_area)
global_area = domain.comm.allreduce(local_area, op=MPI.SUM)
if domain.comm.rank == 0:
     print(global_area)

or

insulated_area = fem.form(1*ds(markers.insulated))
local_area = fem.assemble_scalar(insulated_area)
global_area = domain.comm.reduce(local_area, op=MPI.SUM, root=0)
if domain.comm.rank == 0:
     print(global_area)

which would only create a variable on rank 0.

nate · August 29, 2023, 5:33pm

Consider a tutorial on the use of MPI, e.g:

https://mpi4py.readthedocs.io/en/stable/tutorial.html

dokken · August 29, 2023, 5:35pm

You can also consider: Introduction to MPI — MPI tutorial
which is tailored towards dolfinx

leshinka · August 29, 2023, 5:39pm

Thank you @dokken and @nate !

leshinka · August 30, 2023, 11:25am

I think my issue might be related to memory limitations that don’t raise OOM error. I have run the same problem for variation of mesh sizes (4 problems to be exact) and the RAM usage per node is shy of 384G for the second-largest mesh I am working with 46 million tetrahedral elements. I used 20 nodes each with 384G for each problem and I am only getting the vector reserve error for a problem with 57 million tetrahedral elements. The accompanying facets are less than a million triangle elements, so they don’t make much of a difference. I don’t have access to a larger node RAM memory, so I cannot falsify the theory.

<Xdmf Version="3.0"><Domain><Grid Name="Grid"><Geometry GeometryType="XYZ"><DataItem DataType="Float" Dimensions="9261442 3" Format="HDF" Precision="8">tetr.h5:/data0</DataItem></Geometry><Topology TopologyType="Tetrahedron" NumberOfElements="56817384" NodesPerElement="4"><DataItem DataType="Int" Dimensions="56817384 4" Format="HDF" Precision="8">tetr.h5:/data1</DataItem></Topology><Attribute Name="name_to_read" AttributeType="Scalar" Center="Cell"><DataItem DataType="Int" Dimensions="56817384" Format="HDF" Precision="8">tetr.h5:/data2</DataItem></Attribute></Grid></Domain></Xdmf>

@fluid what’s the context for your problem?

nate · August 30, 2023, 2:21pm

Assuming the mesh and matrices fit into memory, which linear solution method are you employing?

dokken · August 30, 2023, 2:21pm

It is to me unclear if you still run things in serial or if you run in parallel now?
Are you using MPI.COMM_WORLD when declaring/loading your mesh?

leshinka · August 30, 2023, 2:45pm

Yes, I am using MPI.COMM_WORLD when declaring/loading the mesh. I have grabbed relevant parts showing mesh loading:

comm = MPI.COMM_WORLD
logger.debug("Loading tetrahedra (dim = 3) mesh..")
with io.XDMFFile(comm, tetr_mesh_path, "r") as infile3:
    domain = infile3.read_mesh(cpp.mesh.GhostMode.none, 'Grid')
    ct = infile3.read_meshtags(domain, name="Grid")
domain.topology.create_connectivity(domain.topology.dim, domain.topology.dim - 1)
with io.XDMFFile(comm, tria_mesh_path, "r") as infile2:
    ft = infile2.read_meshtags(domain, name="Grid")
left_boundary = ft.find(markers.left_cc)
right_boundary = ft.find(markers.right_cc)
meshtags = mesh.meshtags(domain, 2, ft.indices, ft.values)

I am using fem.petsc.LinearProblem with the following Krylov solver options:

options = {
               "ksp_type": "gmres",
               "pc_type": "hypre",
               "ksp_rtol": 1.0e-12
               }

dokken · August 30, 2023, 2:54pm

Do you have a backtrace as to where in the code there is an error? Is it in assembly, solve or somewhere different?

leshinka · August 30, 2023, 3:09pm

The error is during model assembly, when petsc tries to create the matrix. Besides this traceback, I don’t know how to obtain additional trace:

Traceback (most recent call last):
  File "/storage/coda1/p-tf74/0/shared/leshinka/ssb/conductivity.py", line 130, in <module>
    model = fem.petsc.LinearProblem(a, L, bcs=[left_bc, right_bc], petsc_options=options)
  File "/storage/coda1/p-tf74/0/shared/leshinka/spack/var/spack/environments/fenicsx-env/.spack-env/view/lib/python3.10/site-packages/dolfinx/fem/petsc.py", line 567, in __init__
    self._A = create_matrix(self._a)
  File "/storage/coda1/p-tf74/0/shared/leshinka/spack/var/spack/environments/fenicsx-env/.spack-env/view/lib/python3.10/site-packages/dolfinx/fem/petsc.py", line 131, in create_matrix
    return _cpp.fem.petsc.create_matrix(a)
ValueError: vector::reserve

Is this what you are referring to?

nate · August 30, 2023, 3:18pm

This indicates you cannot fit your matrix into memory. How many degrees of freedom are in your global problem? Or can you provide roughly how many degrees of freedom per process you’re trying to solve?

leshinka · August 30, 2023, 3:30pm

In the global problem, there’s roughly DOF Order(28 million). @nate Per your comment and the fact that a problem with slightly fewer DOF nearly exhausted the max memory per node, it just might mean that I cannot run the problem given the memory allocated: Dolfinx petsc error for large mesh - #32 by leshinka. I can rerun the problem with the goal of getting DOF per process, but that would take a bit more time.

nate · August 30, 2023, 3:33pm

28e6 / N_proc will give you a sufficient estimate for number of DoFs per process, where N_proc is the number of processes.

Another factor to consider is the density of the matrix. This is indicated, for example, by the polynomial degree and element type of the FE space you’re employing. Naturally, the more dense the matrix, the more memory required.

Topic		Replies	Views
Large Mesh Error PETSc Error 76 - Adjoint Calculation dolfin-adjoint	34	5591	September 5, 2020
PETSc finite element assembly dies for large matrices Linear Algebra	0	860	November 28, 2019
Petsc error with calling dof_to_vertex	5	391	July 26, 2021
PETSc error when running code on a fine mesh Linear Algebra	0	374	October 7, 2022
[dolfin] transfer matrix crash with petsc 3.20 Errors	0	149	November 2, 2023

Dolfinx petsc error for large mesh

Related Topics