automatic performance tuning of sparse matrix kernels observations and experience performance tuning...

1
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time-consuming work. Richard Vuduc James Demmel Katherine Yelick Yozo Hida Michael deLorimier Shoaib Kamil Rajesh Nishtala Benjamin Lee Context The performance of many applications is dominated by a few computational kernels. Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color-coded by its performance (Mflop/s) on a 333 MHz Sun Ultra- IIi based workstation. It is not obvious how to model this space. Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform. BeBOP Berkeley Benchmarking and OPtimization Group www.cs.berkeley.edu/~richie/bebop An Approach to Automatic Tuning For each kernel, identify and generate a space of implementations, and search for the best one. Tuning Sparse Matrix-Vector Multiply The SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse. Extensions to New Kernels Preliminary results for symmetric A, A T A, and triangular solve. Future Work Integrating with applications, new kernels, further automation, understanding architectural implications. Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix. Sparse triangular solve— Implementation design space includes SSE2 instructions and “switch-to-dense.” A T A times a vector—The matrix A is brought through the memory hierarchy only once. Applications need fast kernels Scientific computing; information retrieval Dense and sparse linear algebra Multimedia; audio and image processing Fast transforms Databases Sorting Security “Cryptokernels” (e.g., modular exponentiation) Hardware complexity is increasing Microprocessor performance difficult to model Widening processor-memory gap; deep memory hierarchies Implementation space Conceptually, the set of “interesting” implementations Depend on kernel and input May vary: o Instruction mix and order o Memory access patterns o Data structures and precisions o Mathematical formulation Searching How? o Exhaustively o Heuristically, guided by models When? o Once per kernel and platform o At compile time o At run-time o Hybrid approaches Sparse matrix data structures Store only non-zeros Data structure overhead + irregular memory access Implementation space Register blocking: exploit existing dense blocks Cache blocking: create better locality in x, y Multiple vectors: reuse elements of A Searching example: selecting a register block size Off-line: One-time characterization of performance on a dense matrix stored in sparse format for all r, c At run-time: Estimate rxc fill, and choose r, c to maximize Mflops dense (r,c) / Fill(r,c) This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others. Register blocking profile—One-time characterization of the machine (Mflop/s). Exploiting symmetry—When A=A T , only half the matrix need be stored, and each element used twice. Symmetric sparse matrix-vector multiply Only store half of the non-zeros Reuse each stored element twice Sparse triangular solve Compute T -1 x, where T is a sparse triangular matrix Exploit naturally occurring dense structure when T comes from certain applications (LU factorization) Multiplying A T A by a vector A can be brought through the memory hierarchy once Arises in linear programming problems, among others (SPARSITY system optimizations also apply to these kernels!) Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right). Why do these four profiles look so different?— We hope to answer this question and understand the implications for current and future architectures and applications. Integration with applications and existing software libraries Creating (via reordering) or exploiting other matrix structures New sparse kernels (e.g., powers A k , triple product A T MA) Further automation: generating implementation generators Understanding performance in terms of underlying architectures SPARSITYPerformance improvement after run-time block-size selection. Multiple vectors—Significant speedups are possible when multiplying by several vectors. Register blocking example— Portion of a sparse matrix with a 4x3 register block grid superimposed.

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc

Automatic Performance Tuning of Sparse Matrix Kernels

Observations and ExperiencePerformance tuning is tedious and time-consuming work.

Richard VuducJames DemmelKatherine YelickYozo Hida

Michael deLorimierShoaib KamilRajesh NishtalaBenjamin Lee

ContextThe performance of many applications is dominated by a few computational kernels.

Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color-coded by its performance (Mflop/s) on a 333 MHz Sun Ultra-IIi based workstation. It is not obvious how to model this space.

Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform.

BeBOPBerkeley Benchmarking and OPtimization Groupwww.cs.berkeley.edu/~richie/bebop

An Approach to Automatic TuningFor each kernel, identify and generate a space of implementations, and search for the best one.

Tuning Sparse Matrix-Vector MultiplyThe SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse.

Extensions to New KernelsPreliminary results for symmetric A, ATA, and triangular solve.

Future WorkIntegrating with applications, new kernels, further automation, understanding architectural implications.

Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix.

Sparse triangular solve—Implementation design space includes SSE2 instructions and “switch-to-dense.”

ATA times a vector—The matrix A is brought through the memory hierarchy only once.

Applications need fast kernels• Scientific computing; information retrieval

Dense and sparse linear algebra

• Multimedia; audio and image processingFast transforms

• DatabasesSorting

• Security“Cryptokernels” (e.g., modular exponentiation)

Hardware complexity is increasing• Microprocessor performance difficult to

model• Widening processor-memory gap; deep

memory hierarchies

Implementation space•Conceptually, the set of “interesting”

implementations•Depend on kernel and input•May vary:

o Instruction mix and ordero Memory access patternso Data structures and precisionso Mathematical formulation

Searching•How?

o Exhaustivelyo Heuristically, guided by models

•When?o Once per kernel and platformo At compile timeo At run-timeo Hybrid approaches

Sparse matrix data structures•Store only non-zeros•Data structure overhead + irregular memory access

Implementation space•Register blocking: exploit existing dense blocks •Cache blocking: create better locality in x, y•Multiple vectors: reuse elements of A

Searching example: selecting a register block size•Off-line: One-time characterization of performance

on a dense matrix stored in sparse format for all r, c•At run-time: Estimate rxc fill, and choose r, c to

maximize

Mflopsdense(r,c) / Fill(r,c)

This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others.

Register blocking profile—One-time characterization of the machine (Mflop/s).

Exploiting symmetry—When A=AT, only half the matrix need be stored, and each element used twice.

Symmetric sparse matrix-vector multiply•Only store half of the non-zeros•Reuse each stored element twice

Sparse triangular solve•Compute T-1x, where T is a sparse triangular matrix•Exploit naturally occurring dense structure when T

comes from certain applications (LU factorization) Multiplying ATA by a vector

•A can be brought through the memory hierarchy once•Arises in linear programming problems, among others

(SPARSITY system optimizations also apply to these kernels!)

Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right).

Why do these four profiles look so different?—We hope to answer this question and understand the implications for current and future architectures and applications.

Integration with applications and existing software libraries Creating (via reordering) or exploiting other matrix structures New sparse kernels (e.g., powers Ak, triple product ATMA) Further automation: generating implementation generators Understanding performance in terms of underlying architectures

SPARSITY—Performance improvement after run-time block-size selection.

Multiple vectors—Significant speedups are possible when multiplying by several vectors.

Register blocking example—Portion of a sparse matrix with a 4x3 register block grid superimposed.