performance engineering of the kernel polynomial method on … · 2015-12-08 · performance...
TRANSCRIPT
![Page 1: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/1.jpg)
Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
Moritz Kreutzer Georg Hager Gerhard Wellein Andreas Pieper Andreas Alvermann Holger Fehske Erlangen Regional Computing Center Friedrich-Alexander University of Erlangen-Nuremberg Institute of Physics Ernst Moritz Arndt University of Greifswald
Platform for Advanced Scientific Computing Conference 2015 (PASC 2015) Minisymposium 16 bdquoSoftware for Exascale Computingldquo This work was supported (in part) by the German Research Foundation (DFG) through the Priority Programs 1648 ldquoSoftware for Exascale Computingrdquo (SPPEXA) under project ESSEX and 1459 ldquoGraphenerdquo
Prologue (I) The ESSEX project
Holger Fehske Institute for Physics U Greifswald Bruno Lang Applied Computer Science U Wuppertal Achim Basermann Simulation amp SW Technology DLR Georg Hager Erlangen Regional Computing Center Gerhard Wellein Computer Science U Erlangen
Holistic Performance Engineering
Applications
Computational Algorithms
Building Blocks
Equipping Sparse Solvers for Exascale Motivation
Hardware Fault tolerance
Energy efficiency New levels of parallelism
Quantum Physics Applications Extremely large sparse matrices eigenvalues spectral properties
time evolution
Exascale Sparse Solver Repository (ESSR)
ESSEX applications Graphene
topological insulators hellip
Quantum physics chemistry
Sparse eigensolvers preconditioners spectral methods
FT concepts programming for
extreme parallelism
ESSEX
June 2 2015 MS Software for Exascale Computing PASC 2015 3
Prologue (II)
What is the Kernel Polynomial Method and why heterogeneous computing
The Kernel Polynomial Method (KPM)
Approximate the complete eigenvalue spectrum of a large sparse matrix
June 2 2015 5 MS Software for Exascale Computing PASC 2015
119919 119961 = 120640 119961
Good approximation to full spectrum (eg Density of States)
Large Sparse
120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951
A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster
Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators
But Few truly heterogeneous software (Using both CPUs and accelerators)
June 2 2015 6 MS Software for Exascale Computing PASC 2015
The Kernel Polynomial Method
Algorithmic Analysis
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 2: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/2.jpg)
Prologue (I) The ESSEX project
Holger Fehske Institute for Physics U Greifswald Bruno Lang Applied Computer Science U Wuppertal Achim Basermann Simulation amp SW Technology DLR Georg Hager Erlangen Regional Computing Center Gerhard Wellein Computer Science U Erlangen
Holistic Performance Engineering
Applications
Computational Algorithms
Building Blocks
Equipping Sparse Solvers for Exascale Motivation
Hardware Fault tolerance
Energy efficiency New levels of parallelism
Quantum Physics Applications Extremely large sparse matrices eigenvalues spectral properties
time evolution
Exascale Sparse Solver Repository (ESSR)
ESSEX applications Graphene
topological insulators hellip
Quantum physics chemistry
Sparse eigensolvers preconditioners spectral methods
FT concepts programming for
extreme parallelism
ESSEX
June 2 2015 MS Software for Exascale Computing PASC 2015 3
Prologue (II)
What is the Kernel Polynomial Method and why heterogeneous computing
The Kernel Polynomial Method (KPM)
Approximate the complete eigenvalue spectrum of a large sparse matrix
June 2 2015 5 MS Software for Exascale Computing PASC 2015
119919 119961 = 120640 119961
Good approximation to full spectrum (eg Density of States)
Large Sparse
120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951
A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster
Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators
But Few truly heterogeneous software (Using both CPUs and accelerators)
June 2 2015 6 MS Software for Exascale Computing PASC 2015
The Kernel Polynomial Method
Algorithmic Analysis
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 3: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/3.jpg)
Equipping Sparse Solvers for Exascale Motivation
Hardware Fault tolerance
Energy efficiency New levels of parallelism
Quantum Physics Applications Extremely large sparse matrices eigenvalues spectral properties
time evolution
Exascale Sparse Solver Repository (ESSR)
ESSEX applications Graphene
topological insulators hellip
Quantum physics chemistry
Sparse eigensolvers preconditioners spectral methods
FT concepts programming for
extreme parallelism
ESSEX
June 2 2015 MS Software for Exascale Computing PASC 2015 3
Prologue (II)
What is the Kernel Polynomial Method and why heterogeneous computing
The Kernel Polynomial Method (KPM)
Approximate the complete eigenvalue spectrum of a large sparse matrix
June 2 2015 5 MS Software for Exascale Computing PASC 2015
119919 119961 = 120640 119961
Good approximation to full spectrum (eg Density of States)
Large Sparse
120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951
A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster
Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators
But Few truly heterogeneous software (Using both CPUs and accelerators)
June 2 2015 6 MS Software for Exascale Computing PASC 2015
The Kernel Polynomial Method
Algorithmic Analysis
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 4: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/4.jpg)
Prologue (II)
What is the Kernel Polynomial Method and why heterogeneous computing
The Kernel Polynomial Method (KPM)
Approximate the complete eigenvalue spectrum of a large sparse matrix
June 2 2015 5 MS Software for Exascale Computing PASC 2015
119919 119961 = 120640 119961
Good approximation to full spectrum (eg Density of States)
Large Sparse
120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951
A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster
Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators
But Few truly heterogeneous software (Using both CPUs and accelerators)
June 2 2015 6 MS Software for Exascale Computing PASC 2015
The Kernel Polynomial Method
Algorithmic Analysis
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 5: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/5.jpg)
The Kernel Polynomial Method (KPM)
Approximate the complete eigenvalue spectrum of a large sparse matrix
June 2 2015 5 MS Software for Exascale Computing PASC 2015
119919 119961 = 120640 119961
Good approximation to full spectrum (eg Density of States)
Large Sparse
120640120783120640120784 hellip 120640119948 hellip 120640119951minus120783120640119951
A Weiszlige G Wellein A Alvermann H Fehske ldquoThe kernel polynomial methodrdquo Rev Mod Phys vol 78 p 275 2006 E di Napoli E Polizzi Y Saad ldquoEfficient estimation of eigenvalue counts in an intervalrdquo Preprint httparxivorgabs13084275 O Bhardwaj Y Ineichen C Bekas and A Curioni ldquoHighly scalable linear time estimation of spectrograms - a tool for very large scale data analysisrdquo SC13 Poster
Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators
But Few truly heterogeneous software (Using both CPUs and accelerators)
June 2 2015 6 MS Software for Exascale Computing PASC 2015
The Kernel Polynomial Method
Algorithmic Analysis
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 6: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/6.jpg)
Why optimize for heterogeneous systems One third of TOP500 perfor- mance stems from accelerators
But Few truly heterogeneous software (Using both CPUs and accelerators)
June 2 2015 6 MS Software for Exascale Computing PASC 2015
The Kernel Polynomial Method
Algorithmic Analysis
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 7: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/7.jpg)
The Kernel Polynomial Method
Algorithmic Analysis
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 8: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/8.jpg)
Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix
Multiple Vector Multiply
The Kernel Polynomial Method Compute Chebyshev polynomials and moments
Holistic view Optimize across all software layers
June 2 2015 8 MS Software for Exascale Computing PASC 2015
Application Loop over random initial states Building blocks (Sparse) linear algebra library
Algorithm Loop over moments
Augmented Sparse Matrix Vector Multiply
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 9: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/9.jpg)
Analysis of the Algorithmic Optimization
bull Minimum code balance of vanilla algorithm complex double precision values 32-bit indices 13 non-zeros per row application topological insulators
119913119959119959119951119959119959119959119959 = 120785120785120785 119913119913119913119913119913119917119959119917119917 (B = inverse computational intensity)
bull Identified bottleneck Memory bandwidth Decrease memory transfers to alleviate bottleneck
bull Algorithmic optimizations reduce code balance 119913119959119938119938_119913119917119956119959 = 120784120784120785 119913119917 kernel fusion 119913119959119938119938_119913119917119956119956119959 119929 = 120783120790120790119929 + 120782120785120785 119913119917 put R vectors in block
June 2 2015 9 MS Software for Exascale Computing PASC 2015
See also W D Gropp D K Kaushik D E Keyes and B F Smith ldquoTowards realistic performance bounds for implicit CFD codesrdquo in Proceedings of Parallel CFD99 Elsevier 1999 p 233
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 10: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/10.jpg)
Consequences of Algorithmic Optimization bull Mitigation of the relevant bottleneck Expected speedup
bull Other bottlenecks become relevant Achieved speedup may not be 119913119959119959119951119959119959119959119959119913119959119938119938_119913119917119956119956119959
bull Block vectors are best stored interleaved May impose larger changes to the codebase
bull aug_spmmv() no part of standard libraries Implementation by hand is necessary
June 2 2015 10 MS Software for Exascale Computing PASC 2015
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 11: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/11.jpg)
CPU roofline performance model
June 2 2015 11 MS Software for Exascale Computing PASC 2015
S Williams A Waterman D Patterson ldquoRoofline An insightful visual performance model for multicore architecturesrdquo Commun ACM vol 52 p 65 2009
119927 = 119939119913
Gflops Performance limit for bandwidth-bound code b = max bandwidth = 50 GBs B = code balance Ω = 119912119912119913119938119959119959 119941119959119913119959 119913119957119959119951119913119957119913119957119913
119924119959119951119959119956119938119956 119941119959119913119959 119913119957119959119951119913119957119913119957119913
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 12: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/12.jpg)
Implementation
How to harness a heterogeneous machine in an efficient way
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 13: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/13.jpg)
Implementation Algorithmic optimizations lead to a potential speedup We ldquomerelyrdquo need an efficient implementation Data or task parallelism bull MAGMA task parallelism between devices httpiclcsutkedumagma
bull Kernel fusion Task parallelism
Data-parallel approach suits our needs
June 2 2015 13 MS Software for Exascale Computing PASC 2015
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 14: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/14.jpg)
Implementation Data-parallel heterogeneous work distribution bull Static work-distribution by matrix rowsentries bull Device workload device performance
SELL-C-σ sparse matrix storage format bull Unified format for all relevant devices
(CPU GPU Xeon Phi) bull Allows for runtime-exchange of matrix data
(dynamic load balancing future work)
June 2 2015 14 MS Software for Exascale Computing PASC 2015
M Kreutzer G Hager G Wellein H Fehske A R Bishop ldquoA unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD unitsrdquo SIAM J Sci Comput vol 36 p C401 2014
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 15: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/15.jpg)
Performance results
Does all this really pay off
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 16: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/16.jpg)
Single-node Heterogeneous Performance
June 2 2015 16 MS Software for Exascale Computing PASC 2015
SNB Intel Xeon Sandy Bridge K20X Nvidia Tesla K20X Complex double precision matrixvectors (topological insulator)
Heterogeneous efficiency
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 17: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/17.jpg)
Large-scale Heterogeneous Performance
June 2 2015 17 MS Software for Exascale Computing PASC 2015
CRAY XC30 ndash Piz Daint
bull 5272 nodes each w bull 1 Intel Sandy Bridge bull 1 NVIDIA K20x
bull Peak 78 PFs bull LINPACK 63 PFs
bull Largest system in Europe
053 PFs (11 of LINPACK)
O(1010) matrix rows
Thanks to CSCSO SchenkT Schulthess for granting access and compute time
Only a single ALLREDUCE at the end
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 18: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/18.jpg)
Epilogue
Try it out (If you want)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 19: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/19.jpg)
Download our building block library amp KPM application httptinyccghost
bull MPI + OpenMP + SIMD + CUDA bull Transparent data-parallel heterogeneous execution bull Affinity-aware task parallelism (checkpointing comm hiding etc) bull Support for block vectors
bull Automatic code generation for common block vector sizes bull Hand-tuned tall skinny dense matrix kernels
bull Fused kernels (arbitrarily ldquoaugmented SpMMVrdquo) bull SELL-C-σ heterogeneous sparse matrix format bull Various sparse eigensolvers implemented and downloadable
ESSEX project webpage httpblogsfaudeessex
June 2 2015 19 MS Software for Exascale Computing PASC 2015
General Hybrid and Optimized Sparse Toolkit
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 20: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/20.jpg)
Backup Slides
Only in the unlikely case I was too fast
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 21: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/21.jpg)
Conclusions bull Model-guided performance engineering of KPM on
CPU and GPU
bull Decoupling from main memory bandwidth
bull Optimized node-level performance
bull Embedding into massively-parallel application code
bull Fully heterogeneous peta-scale performance for a sparse solver
June 2 2015 21 MS Software for Exascale Computing PASC 2015
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-
![Page 22: Performance Engineering of the Kernel Polynomial Method on … · 2015-12-08 · Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems Moritz Kreutzer*,](https://reader034.vdocuments.net/reader034/viewer/2022050516/5f9fed6f0d1e331bae7cdf21/html5/thumbnails/22.jpg)
Outlook bull Applications besides KPM
bull Automatic (amp dynamic) load balancing
bull Optimized GPU-CPU-MPI communication
bull Further optimization techniques (cache blocking )
bull Performance engineering for Xeon Phi (already
supported
June 2 2015 22 MS Software for Exascale Computing PASC 2015
- Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
- Prologue (I)
- Equipping Sparse Solvers for Exascale Motivation
- Prologue (II)
- Foliennummer 5
- Foliennummer 6
- The Kernel Polynomial Method
- Foliennummer 8
- Foliennummer 9
- Foliennummer 10
- Foliennummer 11
- Implementation
- Foliennummer 13
- Foliennummer 14
- Performance results
- Foliennummer 16
- Foliennummer 17
- Epilogue
- Foliennummer 19
- Backup Slides
- Foliennummer 21
- Foliennummer 22
-