Probably you want an optimized BLAS implementation such as OpenBLAS, configured with pthreads or openmp.
cf. SLEPC solver running in Parallel - #2 by dparsons