performance engineering of the kernel polynomial method on … · 2015-12-08 · performance...

22
Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*, Georg Hager*, Gerhard Wellein*, Andreas Pieper # , Andreas Alvermann # , Holger Fehske # *: Erlangen Regional Computing Center, Friedrich-Alexander University of Erlangen-Nuremberg # : Institute of Physics, Ernst Moritz Arndt University of Greifswald Platform for Advanced Scientific Computing Conference 2015 (PASC 2015) Minisymposium 16 „Software for Exascale Computing“ This work was supported (in part) by the German Research Foundation (DFG) through the Priority Programs 1648 “Software for Exascale Computing” (SPPEXA) under project ESSEX and 1459 “Graphene”.

Upload: others

Post on 08-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems

Moritz Kreutzer Georg Hager Gerhard Wellein Andreas Pieper Andreas Alvermann Holger Fehske Erlangen Regional Computing Center Friedrich-Alexander University of Erlangen-Nuremberg Institute of Physics Ernst Moritz Arndt University of Greifswald

Platform for Advanced Scientific Computing Conference 2015 (PASC 2015) Minisymposium 16 bdquoSoftware for Exascale Computingldquo This work was supported (in part) by the German Research Foundation (DFG) through the Priority Programs 1648 ldquoSoftware for Exascale Computingrdquo (SPPEXA) under project ESSEX and 1459 ldquoGraphenerdquo

Prologue (I) The ESSEX project

Holger Fehske Institute for Physics U Greifswald Bruno Lang Applied Computer Science U Wuppertal Achim Basermann Simulation amp SW Technology DLR Georg Hager Erlangen Regional Computing Center Gerhard Wellein Computer Science U Erlangen

Holistic Performance Engineering

Applications

Computational Algorithms

Building Blocks

Equipping Sparse Solvers for Exascale Motivation

Hardware Fault tolerance

Energy efficiency New levels of parallelism

Quantum Physics Applications Extremely large sparse matrices eigenvalues spectral properties

time evolution

Exascale Sparse Solver Repository (ESSR)

ESSEX applications Graphene

topological insulators hellip

Quantum physics chemistry

Sparse eigensolvers preconditioners spectral methods

FT concepts programming for

extreme parallelism

ESSEX

June 2 2015 MS Software for Exascale Computing PASC 2015 3

Prologue (II)

What is the Kernel Polynomial Method and why heterogeneous computing

The Kernel Polynomial Method (KPM)

Approximate the complete eigenvalue spectrum of a large sparse matrix

June 2 2015 5 MS Software for Exascale Computing PASC 2015

119919 119961 = 120640 119961

Good approximation to full spectrum (eg Density of States)

Large Sparse

120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951

A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster

Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators

But Few truly heterogeneous software (Using both CPUs and accelerators)

June 2 2015 6 MS Software for Exascale Computing PASC 2015

The Kernel Polynomial Method

Algorithmic Analysis

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 2: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Prologue (I) The ESSEX project

Holger Fehske Institute for Physics U Greifswald Bruno Lang Applied Computer Science U Wuppertal Achim Basermann Simulation amp SW Technology DLR Georg Hager Erlangen Regional Computing Center Gerhard Wellein Computer Science U Erlangen

Holistic Performance Engineering

Applications

Computational Algorithms

Building Blocks

Equipping Sparse Solvers for Exascale Motivation

Hardware Fault tolerance

Energy efficiency New levels of parallelism

Quantum Physics Applications Extremely large sparse matrices eigenvalues spectral properties

time evolution

Exascale Sparse Solver Repository (ESSR)

ESSEX applications Graphene

topological insulators hellip

Quantum physics chemistry

Sparse eigensolvers preconditioners spectral methods

FT concepts programming for

extreme parallelism

ESSEX

June 2 2015 MS Software for Exascale Computing PASC 2015 3

Prologue (II)

What is the Kernel Polynomial Method and why heterogeneous computing

The Kernel Polynomial Method (KPM)

Approximate the complete eigenvalue spectrum of a large sparse matrix

June 2 2015 5 MS Software for Exascale Computing PASC 2015

119919 119961 = 120640 119961

Good approximation to full spectrum (eg Density of States)

Large Sparse

120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951

A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster

Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators

But Few truly heterogeneous software (Using both CPUs and accelerators)

June 2 2015 6 MS Software for Exascale Computing PASC 2015

The Kernel Polynomial Method

Algorithmic Analysis

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 3: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Equipping Sparse Solvers for Exascale Motivation

Hardware Fault tolerance

Energy efficiency New levels of parallelism

Quantum Physics Applications Extremely large sparse matrices eigenvalues spectral properties

time evolution

Exascale Sparse Solver Repository (ESSR)

ESSEX applications Graphene

topological insulators hellip

Quantum physics chemistry

Sparse eigensolvers preconditioners spectral methods

FT concepts programming for

extreme parallelism

ESSEX

June 2 2015 MS Software for Exascale Computing PASC 2015 3

Prologue (II)

What is the Kernel Polynomial Method and why heterogeneous computing

The Kernel Polynomial Method (KPM)

Approximate the complete eigenvalue spectrum of a large sparse matrix

June 2 2015 5 MS Software for Exascale Computing PASC 2015

119919 119961 = 120640 119961

Good approximation to full spectrum (eg Density of States)

Large Sparse

120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951

A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster

Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators

But Few truly heterogeneous software (Using both CPUs and accelerators)

June 2 2015 6 MS Software for Exascale Computing PASC 2015

The Kernel Polynomial Method

Algorithmic Analysis

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 4: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Prologue (II)

What is the Kernel Polynomial Method and why heterogeneous computing

The Kernel Polynomial Method (KPM)

Approximate the complete eigenvalue spectrum of a large sparse matrix

June 2 2015 5 MS Software for Exascale Computing PASC 2015

119919 119961 = 120640 119961

Good approximation to full spectrum (eg Density of States)

Large Sparse

120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951

A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster

Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators

But Few truly heterogeneous software (Using both CPUs and accelerators)

June 2 2015 6 MS Software for Exascale Computing PASC 2015

The Kernel Polynomial Method

Algorithmic Analysis

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 5: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

The Kernel Polynomial Method (KPM)

Approximate the complete eigenvalue spectrum of a large sparse matrix

June 2 2015 5 MS Software for Exascale Computing PASC 2015

119919 119961 = 120640 119961

Good approximation to full spectrum (eg Density of States)

Large Sparse

120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951

A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster

Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators

But Few truly heterogeneous software (Using both CPUs and accelerators)

June 2 2015 6 MS Software for Exascale Computing PASC 2015

The Kernel Polynomial Method

Algorithmic Analysis

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 6: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators

But Few truly heterogeneous software (Using both CPUs and accelerators)

June 2 2015 6 MS Software for Exascale Computing PASC 2015

The Kernel Polynomial Method

Algorithmic Analysis

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 7: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

The Kernel Polynomial Method

Algorithmic Analysis

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 8: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix

Multiple Vector Multiply

The Kernel Polynomial Method Compute Chebyshev polynomials and moments

Holistic view Optimize across all software layers

June 2 2015 8 MS Software for Exascale Computing PASC 2015

Application Loop over random initial states Building blocks (Sparse) linear algebra library

Algorithm Loop over moments

Augmented Sparse Matrix Vector Multiply

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 9: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Analysis of the Algorithmic Optimization

bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators

119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)

bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck

bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block

June 2 2015 9 MS Software for Exascale Computing PASC 2015

See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 10: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup

bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959

bull Block vectors are best stored interleaved May impose larger changes to the codebase

bull aug_spmmv() no part of standard libraries Implementation by hand is necessary

June 2 2015 10 MS Software for Exascale Computing PASC 2015

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 11: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

CPU roofline performance model

June 2 2015 11 MS Software for Exascale Computing PASC 2015

S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009

119927 = 119939119913

Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913

119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 12: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Implementation

How to harness a heterogeneous machine in an efficient way

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 13: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma

bull Kernel fusion Task parallelism

Data-parallel approach suits our needs

June 2 2015 13 MS Software for Exascale Computing PASC 2015

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 14: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance

SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices

(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data

(dynamic load balancing future work)

June 2 2015 14 MS Software for Exascale Computing PASC 2015

M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 15: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Performance results

Does all this really pay off

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 16: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Single-node Heterogeneous Performance

June 2 2015 16 MS Software for Exascale Computing PASC 2015

SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)

Heterogeneous efficiency

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 17: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Large-scale Heterogeneous Performance

June 2 2015 17 MS Software for Exascale Computing PASC 2015

CRAY XC30 ndash Piz Daint

bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x

bull Peak 78 PFs bull LINPACK 63 PFs

bull Largest system in Europe

053 PFs (11 of LINPACK)

O(1010) matrix rows

Thanks to CSCSO SchenkT Schulthess for granting access and compute time

Only a single ALLREDUCE at the end

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 18: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Epilogue

Try it out (If you want)

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 19: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Download our building block library amp KPM application httptinyccghost

bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors

bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels

bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable

ESSEX project webpage httpblogsfaudeessex

June 2 2015 19 MS Software for Exascale Computing PASC 2015

General Hybrid and Optimized Sparse Toolkit

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 20: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Backup Slides

Only in the unlikely case I was too fast

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 21: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Conclusions bull Model-guided performance engineering of KPM on

CPU and GPU

bull Decoupling from main memory bandwidth

bull Optimized node-level performance

bull Embedding into massively-parallel application code

bull Fully heterogeneous peta-scale performance for a sparse solver

June 2 2015 21 MS Software for Exascale Computing PASC 2015

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22
Page 22: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,

Outlook bull Applications besides KPM

bull Automatic (amp dynamic) load balancing

bull Optimized GPU-CPU-MPI communication

bull Further optimization techniques (cache blocking )

bull Performance engineering for Xeon Phi (already

supported

June 2 2015 22 MS Software for Exascale Computing PASC 2015

  • Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
  • Prologue (I)
  • Equipping Sparse Solvers for Exascale Motivation
  • Prologue (II)
  • Foliennummer 5
  • Foliennummer 6
  • The Kernel Polynomial Method
  • Foliennummer 8
  • Foliennummer 9
  • Foliennummer 10
  • Foliennummer 11
  • Implementation
  • Foliennummer 13
  • Foliennummer 14
  • Performance results
  • Foliennummer 16
  • Foliennummer 17
  • Epilogue
  • Foliennummer 19
  • Backup Slides
  • Foliennummer 21
  • Foliennummer 22