New Data Structures are Necessary and Sufficient
for Dense Linear Algebra Factorization Algorithms
Fred Gustavson Adjunct Professor at Umea University, Umea, Sweden & Emeritus at IBM Research Center, Yorktown Height, NY, USA fg2935@hotmail.com |
Jerzy Wasniewski Inst. of Math. Modeling Danish Technical University Lyngby, Copenhagen, Denmark jw@imm.dtu.dk |
Overview of the Tutorial:
This tutorial will consist four parts on the importance of New Data Structures (NDS) for Cell as well as for multi-core / many core processors in general. It will be covering some pioneering work of Jack Dongarra's team at the Innovative Computing Laboratory (ICL) at Knoxville. We will use their results plus new result of our own and others to demonstrate what changes will probably occur for NDS and why for these new processors. It will be clear that these NDS are also good for traditional cache based processors.
Titles and Abstracts
Part 1 has two pieces:
Piece 1 is on "Blocking and New Generalized Data Structures Lead to Very High Performance Dense Linear Algebra Algorithms on Cell".
Here is a summary:
Cache Blocking is a still key or a central programming paradigm for multi-core / many core. The Algorithms and Architecture Approach say use NDS and kernels for DLA (Dense Linear Algebra) Factorization algorithms. We show that all DLAFA (Dense Linear Algebra Factorization Algorithms) are matrix multiply algorithms. A proof of this is given in 3 key slides. However, DGEMM (General Matrix Matrix Multiplication) is flawed as its API uses standard CM (Column Major) / RM (Row Major) array formats of Fortran and C programming languages. NDS overcomes the flaw: DGEMM implementations use NDS! Papers of Jack Dongarra's team states that use of NDS is the key factor for getting performance on Cell for the Linpack Benchmark.
Piece 2 is on "One, two, three & Higher Dimensions".
Summary:
Matrices are two dimensional objects. Fortran and C store matrices in one dimensional arrays. The main theorem of Dimension Theory proves that representing a 2-D matrix as a 1-D object will not preserve closeness of sub-matrix elements. NDS represents a matrix as a collection of one dimensional partitioned submatrices and hence is a very good approximation of a 2-D matrix.This allows for cache blocking in a best possible way.
Part 2 has three pieces:
Piece 1 is on "Block In-place Transpose of a Rectangular Matrix".
Summary:
In-place transposition was an active research area in the late 1950's to the middle 1970's. Gustavson began to study this problem anew as it related to his current research area, New Data Structures (NDS) for Dense Linear Algebra (DLA). The new results give fast algorithms for in-place transposition in particular as well as for in-place data transformations in general. These new fast algorithms incorporate cache blocking and NDS as their major high performance components.
Piece 2 is an extension of Piece 1.
Summary:
In 2009 Gustavson did new joint work with Lars Karlsson and Bo Kagstrom at Umea University. We are finishing up a paper for TOMs on the subject of Piece 1 as well as extending work of an old neglected research paper by Pall and Seiden on finding the coset structure of an Abelian Group with Applications to Matrix Tranposition. This new work will demonstrate how Piece 2 of Part 1 is connected to 1-D layouts and why NDS are necessary.
Part 3 is on "Three versions of High Performance Minimal Storage Cholesky Algorithm which uses New Data Structures" and will be given by Jerzy Wasniewski.
Summary:
We describe a new data formats for storing triangular, symmetric, and Hermitian matrices. The standard two dimensional arrays of Fortran and C (also known as full format) that are used to store triangular, symmetric, and Hermitian matrices waste nearly half the storage space but provide high performance via the use of level~3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there are no level~3 packed BLAS. We combine the good features of packed and full storage using the new formats to obtain high performance using L3 (level~3). Also, these new formats require exactly the same minimal storage as LAPACK packed format. These new formats even out perform the LAPACK full format for some computer platforms. One of these data structures has become part of the LAPACK library and a suite of Cholesky algorithms using it are also present. This data structure is called Rectangular Full Packed format (RFP). The new LAPACK routines are denoted by PF, TF, HF denoting Positive definite Full, Triangular Full, and Hermitian positive definite Full. An example of one of the new LAPACK routines for Cholesky factorization is _PFTRF where _ stands for S,D,C,Z. _PFTRF has the same functionality as LAPACK routines _POTRF (full format) and _PPTRF (packed format). Full format and packed format are the standard data formats of dense linear algebra and they are the major data formats adopted and used in the LAPACK library.
Part 4 is on "Cell Architecture, Matrix Multiply and Cholesky Factorization".
It has three pieces:
Piece 1 of Part 4 is on "A Basic Introduction of the Cell Architecture".
Summary:
The Cell architecture grew from a challenge posed by Sony and Toshiba to provide power-efficient and cost-effective high-performance processing for a wide range of applications, including the most demanding consumer appliance: game consoles. Cell - also known as the Cell Broadband Engine Architecture (CBEA) - is an innovative solution whose design was based on the analysis of a broad range of workloads in areas such as cryptography, graphics transform and lighting, physics, fast-Fourier transforms (FFT), matrix operations and scientific workloads. As an example of innovation that ensures the clients' success, a team from IBM Research joined colleagues from IBM Systems Technology Group, Sony and Toshiba, to lead the development of a novel architecture that represents a breakthrough in performance for consumer applications. IBM Research participated throughout the development, implementation and software enablement of the architecture, ensuring the timely and efficient application of novel ideas and technology into a product that solves real challenges.
Cell is a heterogeneous chip multiprocessor that consists of an IBM 64-bit Power Architecture core, augmented with eight specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture called Synergistic Processor Unit (SPU), which is for data-intensive processing, like that found in cryptography, media and scientific applications. The system is integrated by a coherent on-chip bus.
Piece 2 of Part 4 is on "Matrix Multiply On Cell".
Summary:
Piece 2 of Lecture 4 is very much related to Part 2. What will emerge is that one can NOT transpose In-place CM (Column Major) Rectangular matrices A in a fast way. For NDS there are fast ways for Square Block (SB) matrix A. In fact, a fast way for CM A is to map CM A in-place to SB A. Next transpose SB A to SB A^T. Finally, map from SB A^T to CM A^T. Piece 2 contains several slides on the Cell Architecture as well as giving a preparation for Piece 3. Based on Lecture 1, we need to consider matrix multiply on Cell if we want to be able to do DLAFA on Cell.
Piece 3 of Part 4 is on "A Cholesky Factorization on Cell"
Summary:
There is a lot of material to cover here. Fred Gustavson will review a paper on Cholesky Factorization on Cell by Kurzak, Buttari and Dongarra (Lapack Working Note 184). Also, he will give improvements and relationships to his research with Umea and UNI-C / IMM over the last twelve years to this paper.