My codes work in serial, but not in parallel due to issues with cache files

In the last months, I was trying to install FEniCS 2018.1.0 in my personal computer and in a cluster. After solving several problems during the installation process, I got to install FEniCS in my computer and I checked it works properly with some codes. However, the same code does not work in the cluster, even when the installation process is supposed to be completed.

I realized the program does not work properly when I tried to run my simulations in the cluster using more than one processor. During the simulation, it seems the program writes some temporal files in “.cache/”. When we are using only one processor, it is the only one which is reading and writing on those files and everything works correctly. When we try to increase the number of processors, each of them try to create their own temporal files, but they find those files that have been created by other processors. For this reason, it tries to back up the existing files, to remove them and to create its own temporal files. Due to this behavior, we can get two different results:

  • In some simulations, the different processors work coordinately and the simulation seems to work. However, the program is terribly slow, because it is making copies and moving files all the time. Moreover, I do not rely on the results, although I can not check them because the program is very slow.

  • In other simulations, when one of the processors is copying those temporal files, another processor try to access to one of those files, it does not find the right file and the program prints the error that is shown below (the error that is shown in that topic is an extract of the error that is printed):

Moving new file over differing existing file:
src: /tmp/tmp983p99_d/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
dst: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
backup: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old
Moving new file over differing existing file:
src: /tmp/tmp3fmi3u_z/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
dst: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
backup: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old
Backup file exists, overwriting.
Moving new file over differing existing file:
src: /tmp/tmpd9tfdgrg/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
dst: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
backup: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old
Backup file exists, overwriting.
Moving new file over differing existing file:
src: /tmp/tmp_7qcgtkr/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
dst: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
backup: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old
Backup file exists, overwriting.
Moving new file over differing existing file:
src: /tmp/tmp3fl3r0x6/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
dst: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
backup: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old
Backup file exists, overwriting.
Moving new file over differing existing file:
src: /tmp/tmpj5jkkqik/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
dst: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz
backup: /home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old
Backup file exists, overwriting.
Traceback (most recent call last):
File “/apps/cent7/anaconda/5.3.1-py37/lib/python3.7/shutil.py”, line 557, in move
os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: ‘/home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz’ -> ‘/home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old.priv.265268745244341306884196712515058776583’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “Elasticidad2D-Fractura.py”, line 165, in
V = FunctionSpace(mesh, U)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dolfin/function/functionspace.py”, line 31, in init
*self._init_from_ufl(*args, *kwargs)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dolfin/function/functionspace.py”, line 43, in _init_from_ufl
mpi_comm=mesh.mpi_comm())
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dolfin/jit/jit.py”, line 47, in mpi_jit
*return local_jit(*args, *kwargs)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dolfin/jit/jit.py”, line 97, in ffc_jit
return ffc.jit(ufl_form, parameters=p)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/ffc/jitcompiler.py”, line 217, in jit
module = jit_build(ufl_object, module_name, parameters)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/ffc/jitcompiler.py”, line 133, in jit_build
generate=jit_generate)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dijitso/jit.py”, line 165, in jit
header, source, dependencies = generate(jitable, name, signature, params[“generator”])
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/ffc/jitcompiler.py”, line 76, in jit_generate
dep_module_name = jit(dep, parameters, indirect=True)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/ffc/jitcompiler.py”, line 217, in jit
module = jit_build(ufl_object, module_name, parameters)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/ffc/jitcompiler.py”, line 133, in jit_build
generate=jit_generate)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dijitso/jit.py”, line 178, in jit
params)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dijitso/build.py”, line 181, in build_shared_library
lockfree_move_file(temp_src_filename, src_filename)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dijitso/system.py”, line 248, in lockfree_move_file
return _lockfree_move_file(src, dst, False)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dijitso/system.py”, line 275, in _lockfree_move_file
_lockfree_move_file(dst, backup, True)
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dijitso/system.py”, line 287, in _lockfree_move_file
move_file(src, priv(ui))
File “/home/acastele/.conda/envs/cent7/5.3.1-py37/FEniCS.env/lib/python3.7/site-packages/dijitso/system.py”, line 234, in move_file
shutil.move(srcfilename, dstfilename)
File “/apps/cent7/anaconda/5.3.1-py37/lib/python3.7/shutil.py”, line 571, in move
copy_function(src, real_dst)
File “/apps/cent7/anaconda/5.3.1-py37/lib/python3.7/shutil.py”, line 257, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File “/apps/cent7/anaconda/5.3.1-py37/lib/python3.7/shutil.py”, line 120, in copyfile
with open(src, ‘rb’) as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: ‘/home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz’
Traceback (most recent call last):
File “/apps/cent7/anaconda/5.3.1-py37/lib/python3.7/shutil.py”, line 557, in move
os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: ‘/home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz’ -> ‘/home/acastele/.cache/dijitso/src/ffc_element_ee3c68ce6482b04838050db8ba0e96b7572c5935.cpp.gz.old.priv.221521007679093248207297234337480157215’

The behavior of the program depends on the number of processors. However, if we increase the number of processors, it will be easier to get the error.

Below, I show one of the codes that works in my personal computer, but not in parallel in the cluster.

from dolfin import *



import numpy as np

Define numerical simulation specific parameters

Define geometrical parameters

Lx = 1.000e+00

Ly = 1.000e+00

Define elasticity theory parameters

lame_lambda = 1.2115e+05

lame_mu = 8.0770e+04

Define fracture model parameters

Gc = 2.700e+00

Define movement parameters

umax = 2.500e-02

Define temporal parameters

time_ini_t = 0.000e+00

time_fin_t = 1.000e+00

time_dt = 1.000e-04

num_steps = 10000

Create mesh

Define empty mesh

mesh = Mesh()

Define mesh editor

editor = MeshEditor()

Open mesh editor

editor.open(mesh, "quadrilateral", 2, 2)

Set numerical simulation discretization options

Nx = 500

Ny = 500

p = 2

Define number of vertices

editor.init_vertices( ( Nx + 1 ) * ( Ny + 1 ) )

Define number of cells

editor.init_cells( Nx * Ny )

Define list of vertices for mesh

for j in range( Ny + 1 ):



	for i in range( Nx + 1 ):



		vertex_index = ( ( ( Nx + 1 ) * ( j ) ) + ( i ) )

	

		vertex_x = ( ( i ) * ( Lx / Nx ) )

		vertex_y = ( ( j ) * ( Ly / Ny ) )

	

		editor.add_vertex( vertex_index, [ vertex_x, vertex_y ] )

Define list of cells for mesh

for j in range( Ny ):



	for i in range ( Nx ):



		cell_index = ( ( Nx * j ) + ( i ) )

	

		cell_vertex_1 = ( ( ( Nx + 1 ) * ( j ) ) + ( i ) )

		cell_vertex_2 = ( ( ( Nx + 1 ) * ( j ) ) + ( i + 1 ) )

		cell_vertex_3 = ( ( ( Nx + 1 ) * ( j + 1 ) ) + ( i ) )

		cell_vertex_4 = ( ( ( Nx + 1 ) * ( j + 1 ) ) + ( i + 1 ) )

	

		editor.add_cell( cell_index, [ cell_vertex_1, cell_vertex_2, cell_vertex_3, cell_vertex_4 ] )

Close mesh editor

editor.close()

Define phase field parameters

l = ( 2 * mesh.hmin() )

Class for interfacing with the Newton solver

class Displacements_Equation(NonlinearProblem):



	def __init__(self, L, a, bc):



		NonlinearProblem.__init__(self)

	

		self.L = L

		self.a = a

		self.bc = bc

	

	def F(self, b, x):



		assemble(self.L, tensor=b)



		self.bc[0].apply(b, x)

		self.bc[1].apply(b, x)

		

	def J(self, A, x):



		assemble(self.a, tensor=A)



		self.bc[0].apply(A)

		self.bc[1].apply(A)





class PhaseField_Equation(NonlinearProblem):



	def __init__(self, L, a, bc):



		NonlinearProblem.__init__(self)

	

		self.L = L

		self.a = a

		self.bc = bc

	

	def F(self, b, x):



		assemble(self.L, tensor=b)



		self.bc.apply(b, x)

		

	def J(self, A, x):



		assemble(self.a, tensor=A)



		self.bc.apply(A)

Define function spaces

U = VectorElement("Lagrange", mesh.ufl_cell(), p)

M = FiniteElement("Lagrange", mesh.ufl_cell(), p)

P = FiniteElement("Lagrange", mesh.ufl_cell(), 1)



V = FunctionSpace(mesh, U)

N = FunctionSpace(mesh, M)

Q = FunctionSpace(mesh, P)

Define trial and test functions

du = TrialFunction(V)

dm = TrialFunction(N)



v = TestFunction(V)

n = TestFunction(N)

Define functions for solutions

u = Function(V)

m = Function(N)

p = Function(Q)



unew = Function(V)

mold = Function(N)



H = Function(Q)

Hold = Function(Q)

Define boundary conditions

Define values for boundary conditions

Value_Fixed = Constant((0.0, 0.0))

Value_Movement = Expression(("umax*t", "0.0"), degree = 1, umax = umax, t = 0.0)



Value_PhaseField = Constant(1.0)

Define boundaries for boundary conditions

class Boundaries_Bottom(SubDomain):



	def inside(self, x, on_boundary):



		return abs( x[1] ) < 1.0e-10 and on_boundary

	

class Boundaries_Top(SubDomain):



	def inside(self, x, on_boundary):



		return abs( x[1] - Ly ) < 1.0e-10 and on_boundary

	

class Boundaries_PhaseField(SubDomain):



	def inside(self, x, on_boundary):



		return x[0] <= Lx / 2. and abs( x[1] - Ly / 2. ) < 2.5e-03

Define boundary conditions

BC_Fixed_Bottom = DirichletBC(V, Value_Fixed, Boundaries_Bottom())

BC_Movement_Top = DirichletBC(V, Value_Movement, Boundaries_Top())



BC_Displacements = [BC_Fixed_Bottom, BC_Movement_Top]



BC_PhaseField = DirichletBC(N, Value_PhaseField, Boundaries_PhaseField())

Define expressions used in variational forms

def Psi0(u):



	return ( ( lame_lambda / 2. ) * ( ( grad(u)[0, 0] + grad(u)[1, 1] ) ** ( 2. ) ) ) \

	     + ( ( lame_mu ) * ( ( ( grad(u)[0, 0] ) ** ( 2. ) ) + ( ( grad(u)[1, 1] ) ** ( 2. ) ) + ( grad(u)[0, 1] * grad(u)[1, 0] ) ) ) \

	     + ( ( lame_mu / 2. ) * ( ( ( grad(u)[0, 1] ) ** ( 2. ) ) + ( ( grad(u)[1, 0] ) ** ( 2. ) ) ) )

Define variational problem for time step

Equation_Displacements = ( ( ( ( 1. - mold ) ** ( 2 ) ) * ( inner( ( ( ( lame_lambda ) * ( div(u) ) * ( Identity(2) ) ) + ( ( 2. ) * ( lame_mu ) * ( ( 1. / 2. ) * ( ( grad(u) ) + ( grad(u).T ) ) ) ) ), ( ( 1. / 2. ) * ( ( grad(v) ) + ( grad(v).T ) ) ) ) ) ) * ( dx ) )



Equation_PhaseField_1 = ( ( ( ( l ) ** ( 2 ) ) * ( inner( ( grad(m) ), ( grad(n) ) ) ) ) * ( dx ) )

Equation_PhaseField_2 = ( ( ( ( 1. ) + ( ( ( 2. * l ) / ( Gc ) ) * ( H ) ) ) * ( inner( ( m ), ( n ) ) ) ) * ( dx ) )

Equation_PhaseField_3 = ( ( ( ( ( 2. * l ) / ( Gc ) ) * ( H ) ) * ( n ) ) * ( dx ) )



Equation_PhaseField = ( Equation_PhaseField_1 + Equation_PhaseField_2 - Equation_PhaseField_3 )

Compute directional derivative (Jacobian)

Jacobian_Displacements = derivative(Equation_Displacements, u, du)

Jacobian_PhaseField = derivative(Equation_PhaseField, m, dm)

Create nonlinear problem

Problem_Displacements = Displacements_Equation(Equation_Displacements, Jacobian_Displacements, BC_Displacements)

Problem_PhaseField = PhaseField_Equation(Equation_PhaseField, Jacobian_PhaseField, BC_PhaseField)

Create Newton solver

Displacements_solver = NewtonSolver()



Displacements_solver.parameters["absolute_tolerance"] = 1.000e-50

Displacements_solver.parameters["convergence_criterion"] = "residual"

Displacements_solver.parameters["error_on_nonconvergence"] = True

Displacements_solver.parameters["linear_solver"] = "cg"

Displacements_solver.parameters["maximum_iterations"] = 100

Displacements_solver.parameters["preconditioner"] = "hypre_euclid"

Displacements_solver.parameters["relative_tolerance"] = 1.000e-05

Displacements_solver.parameters["report"] = True



Displacements_solver.parameters["krylov_solver"]["absolute_tolerance"] = 1.000e-50

Displacements_solver.parameters["krylov_solver"]["error_on_nonconvergence"] = True

Displacements_solver.parameters["krylov_solver"]["maximum_iterations"] = 50000

Displacements_solver.parameters["krylov_solver"]["monitor_convergence"] = False

Displacements_solver.parameters["krylov_solver"]["relative_tolerance"] = 1.000e-05

Displacements_solver.parameters["krylov_solver"]["report"] = True



Displacements_solver.parameters["lu_solver"]["report"] = True

Displacements_solver.parameters["lu_solver"]["symmetric"] = True

Displacements_solver.parameters["lu_solver"]["verbose"] = True



info( Displacements_solver.parameters, True )





PhaseField_solver = NewtonSolver()



PhaseField_solver.parameters["absolute_tolerance"] = 1.000e-50

PhaseField_solver.parameters["convergence_criterion"] = "residual"

PhaseField_solver.parameters["error_on_nonconvergence"] = True

PhaseField_solver.parameters["linear_solver"] = "cg"

PhaseField_solver.parameters["maximum_iterations"] = 100

PhaseField_solver.parameters["preconditioner"] = "hypre_euclid"

PhaseField_solver.parameters["relative_tolerance"] = 1.000e-05

PhaseField_solver.parameters["report"] = True



PhaseField_solver.parameters["krylov_solver"]["absolute_tolerance"] = 1.000e-50

PhaseField_solver.parameters["krylov_solver"]["error_on_nonconvergence"] = True

PhaseField_solver.parameters["krylov_solver"]["maximum_iterations"] = 50000

PhaseField_solver.parameters["krylov_solver"]["monitor_convergence"] = False

PhaseField_solver.parameters["krylov_solver"]["relative_tolerance"] = 1.000e-05

PhaseField_solver.parameters["krylov_solver"]["report"] = True



PhaseField_solver.parameters["lu_solver"]["report"] = True

PhaseField_solver.parameters["lu_solver"]["symmetric"] = True

PhaseField_solver.parameters["lu_solver"]["verbose"] = True



info( PhaseField_solver.parameters, True )

Set form compiler options

parameters["form_compiler"]["cpp_optimize"] = True

parameters["form_compiler"]["optimize"] = True



info( parameters, True )

Define time - stepping

for i in range(num_steps):

Update current time

	time_t = ( ( time_ini_t ) + ( i * time_dt ) )

Update current time for boundary conditions

	Value_Movement.t = time_t

Solve variational problem for time step (step 1)

	print (" \n Solving Displacements Equation \n ")



	Displacements_solver.solve(Problem_Displacements, u.vector())



	unew.assign(u)

Update maximum strain energy

	Hn = project(Psi0(unew), Q)



	zz = np.maximum(Hn.vector().get_local(), Hold.vector().get_local())



	p.vector().set_local(zz)



	assign(H, p)

Solve variational problem for time step (step 2)

	print (" \n Solving PhaseField Equation \n ")



	PhaseField_solver.solve(Problem_PhaseField, m.vector())



	mold.assign(m)

Update historical maximum strain energy

	Hn = project(Psi0(unew), Q)



	zz = np.maximum(Hn.vector().get_local(), Hold.vector().get_local())



	p.vector().set_local(zz)



	assign(Hold, p)

Save solution to file in VTK format

	if ( ( i % 400 ) == ( 0 ) ):



		print ( " \n\n Simulation Time: ", time_t, " \n\n " )

	

		u.rename( "Displacements", "Displacements" )

	

		m.rename( "PhaseField", "PhaseField" )



		vtkfile_Displacements = File( "/scratch/brown/acastele/simulations/Elasticity2D-FractureModel-Displacements-" + str(i) + ".pvd" )

	

		vtkfile_Displacements << ( u, time_t )



		vtkfile_PhaseField = File( "/scratch/brown/acastele/simulations/Elasticity2D-FractureModel-PhaseField-" + str(i) + ".pvd" )

	

		vtkfile_PhaseField << ( m, time_t )

Dear @acastele ,

I am not able to reproduce your error. However, running your problem in parallel on my own computer gave me some indications.

As far as I can understand, the problem is your mesh-generation with mesh editor. I don’t known if the mesh editor has parallel support, and even if it has, I would not recommend creating the mesh as part of the simulation.

Creating a similar mesh as you have in your problem can be done as follows:

Nx = 500
Ny = 500
mesh = UnitSquareMesh.create(Nx, Ny, CellType.Type.quadrilateral)

This results in the code running on my own computer (I did not do the full simulation, but I get passed the place where your code fails).

What happens if you reduce the mesh size and run it on your local computer in parallel?

I used docker:
docker run -ti --rm -v $(pwd):/home/fenics/shared/ -w /home/fenics/shared quay.io/fenicsproject/stable:2018.1.0.r3

Another note is that I would not recommend saving your solution as a “pvd” file when running in parallel. I recommend the HDF5 or XDMF-format .
Could you indicate if this helps?

1 Like

I actually have the same issue.

I have a solver function for Nernst-Planck equation that runs flawlessly in serial, but I want to fit the diffusion coefficient for different datasets. So, I want to run a least squares optimization in parallel for each dataset. I implemented it using python’s multiprocessing and using

import multiprocessing as mp

if __name__ == '__main__':
    dataFiles   = files_with_extension(dataPath,'.h5')
    pool            = mp.Pool(processes=40)
    pool.map(fit_vfb,dataFiles)

vfb_fit is just a function that runs the least squares optimization, the minimal version does something like:

def vfb_fit(datafile):
   hf = h5py.File(dataFile,'r')
   x = np.array(hf['x'])
   y = np.array(hf['y'])

   def model(x,b):
        D = b[0]
        Cm = b[1]
        xsim,ysim = simulate_pnp(D,Cm,args)
        f = interpolate.interp1d(xsim,ysim)
        return f(x)

   def fobj(x,b)
       yinterp = model(x,b)
       return yinterp - y

   res = optimize.least_squares(fobj,[1E-1,2],kwargs)

   save_result(res)

I tried the code with just one datafile and it does work. But dijitso seems to mess with the cache files…