automatic performance tuning of sparse matrix kernels observations and experience performance tuning...

Automatic Performance Tuning of Sparse Matrix Kernels

Observations and ExperiencePerformance tuning is tedious and time-consuming work.

Richard VuducJames DemmelKatherine YelickYozo Hida

Michael deLorimierShoaib KamilRajesh NishtalaBenjamin Lee

ContextThe performance of many applications is dominated by a few computational kernels.

Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color-coded by its performance (Mflop/s) on a 333 MHz Sun Ultra-IIi based workstation. It is not obvious how to model this space.

Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform.

BeBOPBerkeley Benchmarking and OPtimization Groupwww.cs.berkeley.edu/~richie/bebop

An Approach to Automatic TuningFor each kernel, identify and generate a space of implementations, and search for the best one.

Tuning Sparse Matrix-Vector MultiplyThe SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse.

Extensions to New KernelsPreliminary results for symmetric A, ATA, and triangular solve.

Future WorkIntegrating with applications, new kernels, further automation, understanding architectural implications.

Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix.

Sparse triangular solve—Implementation design space includes SSE2 instructions and “switch-to-dense.”

ATA times a vector—The matrix A is brought through the memory hierarchy only once.

Applications need fast kernels• Scientific computing; information retrieval

Dense and sparse linear algebra

• Multimedia; audio and image processingFast transforms

• DatabasesSorting

• Security“Cryptokernels” (e.g., modular exponentiation)

Hardware complexity is increasing• Microprocessor performance difficult to

model• Widening processor-memory gap; deep

memory hierarchies

Implementation space•Conceptually, the set of “interesting”

implementations•Depend on kernel and input•May vary:

o Instruction mix and ordero Memory access patternso Data structures and precisionso Mathematical formulation

Searching•How?

o Exhaustivelyo Heuristically, guided by models

•When?o Once per kernel and platformo At compile timeo At run-timeo Hybrid approaches

Sparse matrix data structures•Store only non-zeros•Data structure overhead + irregular memory access

Implementation space•Register blocking: exploit existing dense blocks •Cache blocking: create better locality in x, y•Multiple vectors: reuse elements of A

Searching example: selecting a register block size•Off-line: One-time characterization of performance

on a dense matrix stored in sparse format for all r, c•At run-time: Estimate rxc fill, and choose r, c to

maximize

Mflopsdense(r,c) / Fill(r,c)

This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others.

Register blocking profile—One-time characterization of the machine (Mflop/s).

Exploiting symmetry—When A=AT, only half the matrix need be stored, and each element used twice.

Symmetric sparse matrix-vector multiply•Only store half of the non-zeros•Reuse each stored element twice

Sparse triangular solve•Compute T-1x, where T is a sparse triangular matrix•Exploit naturally occurring dense structure when T

comes from certain applications (LU factorization) Multiplying ATA by a vector

•A can be brought through the memory hierarchy once•Arises in linear programming problems, among others

(SPARSITY system optimizations also apply to these kernels!)

Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right).

Why do these four profiles look so different?—We hope to answer this question and understand the implications for current and future architectures and applications.

Integration with applications and existing software libraries Creating (via reordering) or exploiting other matrix structures New sparse kernels (e.g., powers Ak, triple product ATMA) Further automation: generating implementation generators Understanding performance in terms of underlying architectures

SPARSITY—Performance improvement after run-time block-size selection.

Multiple vectors—Significant speedups are possible when multiplying by several vectors.

Register blocking example—Portion of a sparse matrix with a 4x3 register block grid superimposed.

automatic performance tuning of sparse matrix kernels observations and experience performance tuning...

Documents

equivalent dense matrix

tuning sparse matrixvector

performance mflops

matrix need

vectorthe matrix

space of implementations

experience performance

mflops dense r