The Breadth of Autotuning in Dense Linear Algebra on Multicore Systems with Accelerators

Jakub Kurzak

Numerical libraries such as PLASMA and MAGMA target shared memory system with multiple sockets of multicore processors and multiple GPU accelerators.In order to deliver good performance on such systems, autotuning is essential and requires: tuning the CPU kernels for the highest serial performance and parallel performance, tuning the GPU kernels for the highest performance, and then finding the best load balance between the CPU cores and the GPU devices. This presentation highlights some of the approaches used in the PLASMA and MAGMA libraries to tackle the problem.