Automation in Computational Biology

Prof. Paolo Bientinesi, Ph.D. RWTH Aachen, AICES

In the past 30 years the development of linear algebra libraries has been tremendously successful, resulting in a variety of reliable and efficient computational kernels. Unfortunately these kernels--meant to become the building blocks for scientific and engineering applications--are not designed to exploit knowledge relative to the specific target application. If opportunely used, this extra knowledge may lead to domain-specific algorithms that attain higher performance than any traditional library.

As a case study, we look at a common operation in computational biology, the computation of mixed-effects models; in particular, we consider the use of mixed models in the context of genome analysis. At the core of this operation lays a generalized least square problem (GLS); GLS may be directly solved with Matlab, or may be reduced to a form accepted by LAPACK. Either way, none of these solutions can exploit the special structure of GLS within genome analysis. Specifically, as part of this application it has to be solved not one, but a large two-dimensional parametric sequence of GLS'. Since the performance of an algorithm is directly affected by the choice of parameters, a family of algorithms is needed.

In this talk we show how automation comes to help. We introduce a symbolic system, written in Mathematica, that takes as input a matrix equation and automatically returns a family of algorithms to solve the equation. The system has knowledge of matrix properties, matrix factorizations, and rules of linear algebra; it decomposes the input equation into a sequence of building blocks and maps them onto available high-performance kernels. Automation is achieved through extensive use of pattern matching and rewrite rules. When applied to GLS in the context of genome analysis, it generates algorithms that outperform LAPACK by a factor of six.