I understand that parallelization in Ubuntu should work out of the box.
I am using FEniCS on a workstation with 2 Intel Xeon CPUs (20 cores each) and on a laptop with 1 Intel i7 CPU (8 cores). Both are running Ubuntu 18.04 and FEniCS has been installed from the ppa:fenics-packages/fenics repository.
I must be missing something very basic, but all my codes run in parallel on the workstation (using all 40 cores) and not on the laptop where only one core is used.
As a working example you can take any demo… for instance the Cahn-Hilliard demo
1 Like
I wonder if it could it be a kernel related problem. On the workstation I am running a generic Ubuntu kernel while the laptop has an OEM kernel
Any hints?
1 Like
What does nproc
report on the laptop?
How are you launching your job on the laptop?
1 Like
nproc reports 8 on the laptop and 40 on the workstation
I am launching the job from the command line (both for laptop and workstation) with “python3 my_program.py”
1 Like
No, multiprocessor jobs don’t run that way. You’ve only launched it as a single processor job.
Launch a multiprocessor job with
mpirun -n <N> python3 my_program.py
mpirun
or mpiexec
, same thing.
2 Likes
I know that I could use mpirun, but I am not sure that I want to run ncore instances of the same program in parallel.
Besides duplicating plots, etc., this would not be efficient, as it is commented, for instance, in this post.
I actually want to exploit the multi-thread capability of linear algebra solvers that works fine on the workstation and I don’t understand why the same doesn’t work on the laptop.
To be more explicit, a sequential run of a code on the workstation uses all 40 cores when solving a linear system. As far as I understand, this should be the default behavior.
1 Like
OK, quite a different scenario then. You want multithreading, not multicore. Dolfin (and petsc) is set up to use MPI, so it’s normal that a single processor launch would only run over 1 processor. The weird thing here is the opposite situation, why is your workstation running multithreaded?
Multithreading would mean compiled with OpenMP support, or with pthreads. libpetsc is compiled against pthreads, so that may be the multithreading your workstation is giving you.
Common advice is to set OMP_NUM_THREADS=1 to run MPI jobs, so that multithreading does not interfere with MPI operations. (It is possible in principle to have both in play at the same time but it tends to complicate the code to the point that it becomes unmaintainable). But that’s done to make sure there’s only one OpenMP thread (not pthread) per process.
In your case, look into what controls PETSc’s use of pthreads. I would have expected your workstation to work the same way as the laptop if they’re both running the same Ubuntu. If they’re behaving differently then perhaps there’s some environment variable set differently somewhere, say in /etc/profile.d. Or since you’ve got different kernels, perhaps libpthread inspects kernel capabilities. I don’t know if it’s possible to control communication between libpthread and the kernel, I can’t see anything in pthread docs about it.
There’s some discussion of PETSc and pthreads at
https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html
https://www.mcs.anl.gov/petsc/meetings/2016/slides/nothreads.pdf
I’ve no complete answer but I hope that helps a little.
1 Like
Thinking about it more, I’m increasingly certain your BLAS is behind it. PETSc builds against pthreads, but there are the lower level libraries to consider as well, MUMPS, Scalapack, BLAS.
There are several implementations of BLAS, OpenBLAS, BLIS and others (ATLAS, and others still). OpenBLAS is alternatively built against pthreads or openmp (or without threading). There are also 64-bit builds (64-bit pointers), but the rest of the stack is not using them yet.
If you haven’t specified which blas to install on your system then you probably have the reference implementation, libblas-dev which performs poorly. Looks like a thread-optimised implementation has been installed on your workstation. You’ll want to choose one that works well for your system. OpenBLAS is probably a good generic choice. ATLAS only performs well when compiled with the specific flags that properly optimise for your CPU. BLIS is new, maybe its ok. Intel’s MKL is also possible.
The various BLAS alternatives can be installed at the same time, and linking to them is dynamic (runtime) via libblas.so.3. You can choose your preferred alternative with sudo update-alternatives --config libblas.so-x86_64-linux-gnu
(and sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
. The Debian BLAS developer hasn’t chosen to use simple names like “blas” for the alternative, alas).
tl-dr; sudo apt-get install libopenblas-pthread-dev
on your laptop, might help.
1 Like
Your guess is correct. I made some experiments yesterday and I wanted to wait for more data before reporting.
I checked the dependencies of PETSc:
ldd /usr/lib/petsc/lib/libpetsc_real.so.3.7.7
It was apparent that the main difference was the presence on the workstation of two blas implementations: libblas.so.3 and libopenblas.so.0, while on the laptop only the first one was available. I then installed openblas on the laptop as well
sudo apt-get install libopenblas-base
This was enough to reproduce on the laptop the same behavior as the one observed on the workstation.
I understand that this is not the preferred option according to the developers of PETSc, but I think that for a user who is mainly interested in speeding up sequential codes, is a very easy and efficient solution.
As already mentioned by dparsons if you want to use mpirun and avoid multi-threading, then you use
OMP_NUM_THREADS=1
or also
OPENBLAS_NUM_THREADS=1
2 Likes
Just as a postscript, BLIS might be one to look into more closely. Their benchmarks suggest it outperforms OpenBLAS in many situations. Often keeps up with MKL, though MKL performs better overall. apt-get install libblis3-pthread
(or libblis3-openmp, or libblis3 if you don’t care which threading)
https://github.com/flame/blis/blob/master/docs/Performance.md