a prediction framework for fast sparse triangular solves
TRANSCRIPT
*Koç University, Istanbul, Turkey
parcorelab.com
A Prediction Framework for Fast SparseTriangular Solves
Najeeb Ahmad*, Buse Yilmaz*, Didem Unat*
Best Artifact Awardee
Outline
• Part I: Main Topic
– Introduction
– Background and Motivation
– Prediction Framework
– Evaluation
– Related Work
– Conclusion
A Prediction Framework for Fast Sparse Triangular Solves 2
Outline
• Part I: Main Topic
– Introduction
– Background and Motivation
– Prediction Framework
– Evaluation
– Related Work
– Conclusion
• Part II: Artifact Evaluation
– Our Artifact Evaluation Experience
A Prediction Framework for Fast Sparse Triangular Solves 2
Introduction
• Sparse Triangular Solve (SpTRSV)
– an important computational kernel
– most time-consuming part of an application in many cases• e.g. ILU-preconditioned GMRES solvers [1]
A Prediction Framework for Fast Sparse Triangular Solves 3
Introduction
• Sparse Triangular Solve (SpTRSV)
– an important computational kernel
– most time-consuming part of an application in many cases
• e.g. ILU-preconditioned GMRES solvers [1]
• Many CPU, GPU SpTRSV algorithms available
– CPU: Intel MKL, Park et al. [2]
– GPU: NVIDIA cuSPARSE library, Liu et al. [3], Li et al. [4]
A Prediction Framework for Fast Sparse Triangular Solves 3
Introduction
• Sparse Triangular Solve (SpTRSV)
– an important computational kernel
– most time-consuming part of an application in many cases
• e.g. ILU-preconditioned GMRES solvers [1]
• Many CPU, GPU SpTRSV algorithms available
– CPU: Intel MKL, Park et al. [2]
– GPU: NVIDIA cuSPARSE library, Liu et al. [3], Li et al. [4]
• No single algorithm/platform performs best for all input matrices
– Algorithm performance varies with matrix sparsity pattern
A Prediction Framework for Fast Sparse Triangular Solves 3
Introduction
• Sparse Triangular Solve (SpTRSV)
– an important computational kernel
– most time-consuming part of an application in many cases
• e.g. ILU-preconditioned GMRES solvers [1]
• Many CPU, GPU SpTRSV algorithms available
– CPU: Intel MKL, Park et al. [2]
– GPU: NVIDIA cuSPARSE library, Liu et al. [3], Li et al. [4]
• No single algorithm/platform performs best for all input matrices
– Algorithm performance varies with matrix sparsity pattern
Selecting the fastest SpTRSV algorithm is a non-trivial task!!!
A Prediction Framework for Fast Sparse Triangular Solves 3
Contributions
• A machine learning-based framework
– predicts the fastest SpTRSV algorithm on heterogeneous CPU-GPU systems
– automated feature extraction, performance data collection and model training
– Extensible with new training datasets, algorithms
A Prediction Framework for Fast Sparse Triangular Solves 4
Contributions
• A machine learning-based framework
– predicts the fastest SpTRSV algorithm on heterogeneous CPU-GPU systems
– automated feature extraction, performance data collection and model training
– Extensible with new training datasets, algorithms
• Performance, accuracy, overhead evaluation of the framework on state-of-the-art CPU-GPU system
A Prediction Framework for Fast Sparse Triangular Solves 4
Contributions
• A machine learning-based framework
– predicts the fastest SpTRSV algorithm on heterogeneous CPU-GPU systems
– automated feature extraction, performance data collection and model training
– Extensible with new training datasets, algorithms
• Performance, accuracy, overhead evaluation of the framework on state-of-the-art CPU-GPU system
• Performance study of six SpTRSV algorithms (CPU & GPU)
A Prediction Framework for Fast Sparse Triangular Solves 4
Contributions
• A machine learning-based framework
– predicts the fastest SpTRSV algorithm on heterogeneous CPU-GPU systems
– automated feature extraction, performance data collection and model training
– Extensible with new training datasets, algorithms
• Performance, accuracy, overhead evaluation of the framework on state-of-the-art CPU-GPU system
• Performance study of six SpTRSV algorithms (CPU & GPU)
• Identification of matrix sparsity features SpTRSV for performance prediction
A Prediction Framework for Fast Sparse Triangular Solves 4
Background and Motivation
• Sparse triangular systems
Ly = b or Ux = y
A Prediction Framework for Fast Sparse Triangular Solves 5
Background and Motivation
• Sparse triangular systems
Ly = b or Ux = y
A Prediction Framework for Fast Sparse Triangular Solves 5
L
Background and Motivation
• Sparse triangular systems
Ly = b or Ux = y
A Prediction Framework for Fast Sparse Triangular Solves 5
L U
Background and Motivation
• SpTRSV characteristics
A Prediction Framework for Fast Sparse Triangular Solves 6
L sparsity pattern
Background and Motivation
• SpTRSV characteristics
A Prediction Framework for Fast Sparse Triangular Solves 6
L sparsity pattern Dependency Graph for L
Background and Motivation
• SpTRSV Algorithms
– Level-scheduling
– Synchronization-free
A Prediction Framework for Fast Sparse Triangular Solves 7
Background and Motivation
• SpTRSV Algorithms
– Level-scheduling
– Synchronization-free
A Prediction Framework for Fast Sparse Triangular Solves 7
Background and Motivation
• SpTRSV Algorithms– Level-scheduling
– Synchronization-free
• SpTRSV performance– CPU
• Intel MKL library– MKL(sequential)
– MKL(parallel)
– GPU• NVIDIA cuSPARSE library
– CUS1
– CUS2(level)
– CUS2(no level)
• Sync-Free [3]
A Prediction Framework for Fast Sparse Triangular Solves 7
Background and Motivation
• SpTRSV Algorithms– Level-scheduling
– Synchronization-free
• SpTRSV performance– CPU
• Intel MKL library– MKL(sequential)
– MKL(parallel)
– GPU• NVIDIA cuSPARSE library
– CUS1
– CUS2(level)
– CUS2(no level)
• Sync-Free [3]
A Prediction Framework for Fast Sparse Triangular Solves 7
MKL(seq)30%
MKL(par)
5%
CUS119%
CUS2(lvl)19%
CUS2(no lvl)5%
Sync-Free22%
Fastest SpTRSV on Intel Gold(6148) + NVIDIA V100 GPU for37 sparse matrices (from SuiteSparse collection)
Background and Motivation
• SpTRSV Algorithms– Level-scheduling
– Synchronization-free
• SpTRSV performance– CPU
• Intel MKL library– MKL(sequential)
– MKL(parallel)
– GPU• NVIDIA cuSPARSE library
– CUS1
– CUS2(level)
– CUS2(no level)
• Sync-Free [3]
A Prediction Framework for Fast Sparse Triangular Solves 7
CPU, 35%
GPU, 65%
Fastest SpTRSV on Intel Gold(6148) + NVIDIA V100 GPU for37 sparse matrices (from SuiteSparse collection)
Background and Motivation
• SpTRSV Algorithms– Level-scheduling
– Synchronization-free
• SpTRSV performance– CPU
• Intel MKL library– MKL(sequential)
– MKL(parallel)
– GPU• NVIDIA cuSPARSE library
– CUS1
– CUS2(level)
– CUS2(no level)
• Sync-Free [3]
A Prediction Framework for Fast Sparse Triangular Solves 7
How to select the fastest SpTRSV algorithmfor a given input matrix on a CPU-GPU platform?
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
A Prediction Framework for Fast Sparse Triangular Solves 8
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
1
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
MatrixFeature
Extractor
1 2
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
SpTRSVAlgorithmRepository
MatrixFeature
Extractor
1 2
3
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
SpTRSVAlgorithmRepository
MatrixFeature
Extractor
Performance Data
Collector
1 2
3 4
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
SpTRSVAlgorithmRepository
MatrixFeature
Extractor
Performance Data
Collector
Model Trainer And
Tester
1 2
3 4
5
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
SpTRSVAlgorithmRepository
MatrixFeature
Extractor
Performance Data
Collector
Model Trainer And
Tester
1 2
3 4
5
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
SpTRSVAlgorithmRepository
MatrixFeature
Extractor
Performance Data
Collector
Model Trainer And
Tester
PredictionModel
1 2
3 4
5
SpTRSV Prediction Framework
• A machine learning-based framework for the fastest SpTRSV prediction on a CPU-GPU machine
– Based on features and SpTRSV performance data of a pre-selected matrix set
• Framework overview
– Five components
A Prediction Framework for Fast Sparse Triangular Solves 8
SparseMatrixDataset
SpTRSVAlgorithmRepository
MatrixFeature
Extractor
Performance Data
Collector
Model Trainer And
Tester
PredictionModel
Input SparseMatrix
PredictedSpTRSVAlgorithm
1 2
3 4
5
Feature Selection
• Structural or sparsity features
– Started with ~50 features, 30 features finalized• Based on feature scores
A Prediction Framework for Fast Sparse Triangular Solves 9
Feature Selection
• Selected Features
A Prediction Framework for Fast Sparse Triangular Solves 10
No. Features Description Score rank
1 nnzs # of non-zeros 1
6 m Number of rows/columns 6
18 lvls # of levels 15
Feature Selection
• Selected Features
A Prediction Framework for Fast Sparse Triangular Solves 10
No. Features Description Score rank
1 nnzs # of non-zeros 1
-2-4 <max, mean, std>_nnz_pl_rw nnz per level row wise stats 2,4,5
-5 max_nnz_pl_cw nnz per level col wise stats 3
6 m Number of rows/columns 6
18 lvls # of levels 15
Feature Selection
• Selected Features
A Prediction Framework for Fast Sparse Triangular Solves 10
No. Features Description Score rank
1 nnzs # of non-zeros 1
-2-4 <max, mean, std>_nnz_pl_rw nnz per level row wise stats 2,4,5
-5 max_nnz_pl_cw nnz per level col wise stats 3
6 m Number of rows/columns 6
7-10 <max,mean,median,std>_rpl Rows per level stats 7,12,13,16
18 lvls # of levels 15
Feature Selection
• Selected Features
A Prediction Framework for Fast Sparse Triangular Solves 10
No. Features Description Score rank
1 nnzs # of non-zeros 1
-2-4 <max, mean, std>_nnz_pl_rw nnz per level row wise stats 2,4,5
-5 max_nnz_pl_cw nnz per level col wise stats 3
6 m Number of rows/columns 6
7-10 <max,mean,median,std>_rpl Rows per level stats 7,12,13,16
18 lvls # of levels 15
26-30 >_mean_<max,std_mean,median,min rl_pl Row-length per level stats 21,23,24,25,26
Feature Selection
• Selected Features
A Prediction Framework for Fast Sparse Triangular Solves 10
No. Features Description Score rank
1 nnzs # of non-zeros 1
-2-4 <max, mean, std>_nnz_pl_rw nnz per level row wise stats 2,4,5
-5 max_nnz_pl_cw nnz per level col wise stats 3
6 m Number of rows/columns 6
7-10 <max,mean,median,std>_rpl Rows per level stats 7,12,13,16
18 lvls # of levels 15
19-21 >_mean_<max,mean,std cl_pl Column-length per level stats 17,18,20
26-30 >_mean_<max,std_mean,median,min rl_pl Row-length per level stats 21,23,24,25,26
Feature Selection
• Selected Features
A Prediction Framework for Fast Sparse Triangular Solves 10
No. Features Description Score rank
1 nnzs # of non-zeros 1
-2-4 <max, mean, std>_nnz_pl_rw nnz per level row wise stats 2,4,5
-5 max_nnz_pl_cw nnz per level col wise stats 3
6 m Number of rows/columns 6
7-10 <max,mean,median,std>_rpl Rows per level stats 7,12,13,16
13-14 <max,min>_rl_cnt Rows with max/min lengths 9,11
18 lvls # of levels 15
19-21 >_mean_<max,mean,std cl_pl Column-length per level stats 17,18,20
22-25 rl<mx_mean,median,std>_ Row-length stats 19,27,28,30
26-30 >_mean_<max,std_mean,median,min rl_pl Row-length per level stats 21,23,24,25,26
Feature Selection
• Selected Features
A Prediction Framework for Fast Sparse Triangular Solves 10
No. Features Description Score rank
1 nnzs # of non-zeros 1
-2-4 <max, mean, std>_nnz_pl_rw nnz per level row wise stats 2,4,5
-5 max_nnz_pl_cw nnz per level col wise stats 3
6 m Number of rows/columns 6
7-10 <max,mean,median,std>_rpl Rows per level stats 7,12,13,16
<11-12 min,max>_cl_cnt Columns with max/min length 8,10
13-14 <max,min>_rl_cnt Rows with max/min lengths 9,11
15-17 <max,std,median>_cl Column-length stats 14,22,29
18 lvls # of levels 15
19-21 >_mean_<max,mean,std cl_pl Column-length per level stats 17,18,20
22-25 rl<mx_mean,median,std>_ Row-length stats 19,27,28,30
26-30 >_mean_<max,std_mean,median,min rl_pl Row-length per level stats 21,23,24,25,26
SpTRSV Prediction Framework
• Matrix Feature Extractor
– A C++/CUDA tool• Uses CPU, GPU for efficient feature
extraction
A Prediction Framework for Fast Sparse Triangular Solves 11
SpTRSV Prediction Framework
• Matrix Feature Extractor
– A C++/CUDA tool• Uses CPU, GPU for efficient feature
extraction
• Performance data collector
– For input matrix, collect performance data for all algorithms
– Reports ID of the fastest algorithm
A Prediction Framework for Fast Sparse Triangular Solves 11
SpTRSV Prediction Framework
• Matrix Feature Extractor
– A C++/CUDA tool• Uses CPU, GPU for efficient feature
extraction
• Performance data collector
– For input matrix, collect performance data for all algorithms
– Reports ID of the fastest algorithm
• Model Trainer and Tester
A Prediction Framework for Fast Sparse Triangular Solves 11
Model Trainer And
Tester
Matrix features
IDs of fastest algorithm
SpTRSV Prediction Framework
• Matrix Feature Extractor
– A C++/CUDA tool• Uses CPU, GPU for efficient feature
extraction
• Performance data collector
– For input matrix, collect performance data for all algorithms
– Reports ID of the fastest algorithm
• Model Trainer and Tester
• Model Selection
– Scikit-learn library for model selection and evaluation
A Prediction Framework for Fast Sparse Triangular Solves 11
Model Trainer And
Tester
Matrix features
IDs of fastest algorithm
SpTRSV Prediction Framework
• Matrix Feature Extractor
– A C++/CUDA tool• Uses CPU, GPU for efficient feature
extraction
• Performance data collector
– For input matrix, collect performance data for all algorithms
– Reports ID of the fastest algorithm
• Model Trainer and Tester
• Model Selection
– Scikit-learn library for model selection and evaluation
– Supervised machine learning
• Deep learning requires large data sets, training times
A Prediction Framework for Fast Sparse Triangular Solves 11
Model Trainer And
Tester
Matrix features
IDs of fastest algorithm
SpTRSV Prediction Framework
• Matrix Feature Extractor
– A C++/CUDA tool
• Uses CPU, GPU for efficient feature extraction
• Performance data collector
– For input matrix, collect performance data for all algorithms
– Reports ID of the fastest algorithm
• Model Trainer and Tester
• Model Selection
– Scikit-learn library for model selection and evaluation
– Supervised machine learning
• Deep learning requires large data sets, training times
– Evaluated classification models:
• Decision trees
• Random Forests
• Support Vector Machine (with grid-search)
• K-Nearest Neighbors
• Multi-Layer Perceptron
A Prediction Framework for Fast Sparse Triangular Solves 11
Model Trainer And
Tester
Matrix features
IDs of fastest algorithm
SpTRSV Prediction Framework
• Matrix Feature Extractor
– A C++/CUDA tool
• Uses CPU, GPU for efficient feature extraction
• Performance data collector
– For input matrix, collect performance data for all algorithms
– Reports ID of the fastest algorithm
• Model Trainer and Tester
• Model Selection
– Scikit-learn library for model selection and evaluation
– Supervised machine learning
• Deep learning requires large data sets, training times
– Evaluated classification models:
• Decision trees
• Random Forests
• Support Vector Machine (with grid-search)
• K-Nearest Neighbors
• Multi-Layer Perceptron
A Prediction Framework for Fast Sparse Triangular Solves 11
Model Trainer And
Tester
Matrix features
IDs of fastest algorithm
Model Training and Testing
A Prediction Framework for Fast Sparse Triangular Solves 12
SparseMatrixDataset
Feature Scaling
TrainingSet
TestingSet
Model Training and Testing
A Prediction Framework for Fast Sparse Triangular Solves 12
SparseMatrixDataset
Feature Scaling
TrainingSet
TestingSet
75%
25%
Model Training and Testing
A Prediction Framework for Fast Sparse Triangular Solves 12
SparseMatrixDataset
Feature Scaling
TrainingSet
TestingSet M
atri
x Fe
atu
re E
xtra
ctio
n
AlgorithmPerformance Data
ClassifierTraining
75%
25%
Model Training and Testing
A Prediction Framework for Fast Sparse Triangular Solves 12
SparseMatrixDataset
Feature Scaling
TrainingSet
TestingSet M
atri
x Fe
atu
re E
xtra
ctio
n
AlgorithmPerformance Data
ClassifierTraining
Trained Model
75%
25%
Model Training and Testing
A Prediction Framework for Fast Sparse Triangular Solves 12
SparseMatrixDataset
Feature Scaling
TrainingSet
TestingSet M
atri
x Fe
atu
re E
xtra
ctio
n
AlgorithmPerformance Data
ClassifierTraining
Trained Model
Predicted Algorithm
75%
25%
Model Training and Testing
A Prediction Framework for Fast Sparse Triangular Solves 12
SparseMatrixDataset
Feature Scaling
TrainingSet
TestingSet M
atri
x Fe
atu
re E
xtra
ctio
n
AlgorithmPerformance Data
ClassifierTraining
Trained Model
Predicted Algorithm
75%
25%
10-fold cross validation
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
SpTRSV on CPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
SpTRSV on CPU
y
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
SpTRSV on GPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
SpTRSV on GPU
b
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSV
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSVSpTRSV on GPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSVSpTRSV on GPU
y
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSVSpTRSV on CPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSVSpTRSV on CPU
b
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSV
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on different platforms
A Prediction Framework for Fast Sparse Triangular Solves 13
SpTRSV
Computations on CPU
Computations on GPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSV
Data transfershave no impact onAlgorithm selectionin this case
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
SpTRSV on CPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
SpTRSV on GPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
SpTRSV on GPU
y
b
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSV
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSVSpTRSV on GPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSVSpTRSV on CPU
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSVSpTRSV on CPU
y
b
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSV
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSV
Data transfers may impact algorithm selectionin this case
Effects of CPU-GPU Data Transfers
• Computations before/after SpTRSV on same platform
A Prediction Framework for Fast Sparse Triangular Solves 14
SpTRSV
Computations on CPU
Computations on CPU
Nu
merical A
pp
lication
Ly = b
Computations on GPU
Computations on CPU
Nu
merical A
pp
lication
SpTRSV
Framework accountsfor data transfer costs
Evaluation
• Hardware platform
– CPU: Intel Gold (6148)
• 40 cores (20 cores/socket)
– GPU: NVIDIA Tesla V100
A Prediction Framework for Fast Sparse Triangular Solves 15
Evaluation
• Hardware platform
– CPU: Intel Gold (6148)• 40 cores (20 cores/socket)
– GPU: NVIDIA Tesla V100
• Software configuration
– Intel Parallel Studio 2019
– NVIDIA CUDA 10.1
– Compiler options:• -O3
• -gencode arch=compute_70,
code=sm_70
A Prediction Framework for Fast Sparse Triangular Solves 15
Evaluation
• Hardware platform
– CPU: Intel Gold (6148)• 40 cores (20 cores/socket)
– GPU: NVIDIA Tesla V100
• Software configuration
– Intel Parallel Studio 2019
– NVIDIA CUDA 10.1
– Compiler options:• -O3
• -gencode arch=compute_70, code=sm_70
• Sparse Matrix Dataset
– 998 real square matrices from SuiteSparse matrix collection• 1K to 16.24M rows
• 1.074K to 232M nnzs
– Extensible with new matrices
A Prediction Framework for Fast Sparse Triangular Solves 15
Evaluation
• Hardware platform
– CPU: Intel Gold (6148)• 40 cores (20 cores/socket)
– GPU: NVIDIA Tesla V100
• Software configuration
– Intel Parallel Studio 2019
– NVIDIA CUDA 10.1
– Compiler options:• -O3
• -gencode arch=compute_70, code=sm_70
• Sparse Matrix Dataset
– 998 real square matrices from SuiteSparse matrix collection• 1K to 16.24M rows
• 1.074K to 232M nnzs
– Extensible with new matrices
• Performance of SpTRSV Algorithms (998 matrices)
A Prediction Framework for Fast Sparse Triangular Solves 15
MKL(seq)41% MKL(par)
1%
CUS111%
CUS2(lvl)6%
CUS2(no lvl)2%Sync-Free
39%
Evaluation
• Hardware platform
– CPU: Intel Gold (6148)• 40 cores (20 cores/socket)
– GPU: NVIDIA Tesla V100
• Software configuration
– Intel Parallel Studio 2019
– NVIDIA CUDA 10.1
– Compiler options:• -O3
• -gencode arch=compute_70, code=sm_70
• Sparse Matrix Dataset
– 998 real square matrices from SuiteSparse matrix collection• 1K to 16.24M rows
• 1.074K to 232M nnzs
– Extensible with new matrices
• Performance of SpTRSV Algorithms (998 matrices)
A Prediction Framework for Fast Sparse Triangular Solves 15
MKL(seq)41% MKL(par)
1%
CUS111%
CUS2(lvl)6%
CUS2(no lvl)2%Sync-Free
39%
CPU42%
GPU58%
Evaluation: Prediction Accuracy
– 10-fold cross validation
A Prediction Framework for Fast Sparse Triangular Solves 16
Evaluation: Prediction Accuracy
– 10-fold cross validation
– With 30 features
A Prediction Framework for Fast Sparse Triangular Solves 16
Mean ~87% ~89% ~87% ~87%
Evaluation: Prediction Accuracy
– 10-fold cross validation
– With 30 features
– With top 10 features
A Prediction Framework for Fast Sparse Triangular Solves 16
Mean ~87% ~89% ~87% ~87% ~80% ~81% ~80% ~79%
Evaluation: Speedup Gain
A Prediction Framework for Fast Sparse Triangular Solves 17
• Framework achieves significant speedups over arbitrary algorithm choice
Evaluation: Overheads
• Acceptable overheads, especially for large matrices
A Prediction Framework for Fast Sparse Triangular Solves 18
Related Work
• OSKI library [5]
- Runtime autotuning of SpTRSV
• PetaBricks [6]
– Algorithm selection based on data size
• Nitro [7]
– Algorithm selection through user-guided machine learning
• MAPS simulation framework [8]
– Heuristics-based SpTRSV algo. selection on CPU/GPU
– Limited to reservoir simulation
• SpTRSV Algo. Selection on GPUs [9]
– Machine learning-based approach
A Prediction Framework for Fast Sparse Triangular Solves 19
Conclusions
• We use supervised machine learning approach for SpTRSValgorithm selection on CPU-GPU systems
• We implemented the approach as an automated, extensible framework for prediction model training and fastest SpTRSVprediction
• Framework evaluation on Intel Gold CPU + NVIDIA V100 GPU shows 87% model accuracy, 1.4-2.7x mean SpTRSV speedups
A Prediction Framework for Fast Sparse Triangular Solves 20
Paper Artifacts
• Artifacts– Support materials (source code, tools, benchmarks, datasets, models)
required for reproducibility of claimed experimental results1
A Prediction Framework for Fast Sparse Triangular Solves 211 Euro-Par 2020 Call for Artifact Evaluation
Paper Artifacts
• Artifacts– Support materials (source code, tools, benchmarks, datasets, models)
required for reproducibility of claimed experimental results1
• Artifact Evaluation Process (AEP) at Euro-Par1
– Completely optional but highly recommended
– Evaluated by independent committee of experts
A Prediction Framework for Fast Sparse Triangular Solves 211 Euro-Par 2020 Call for Artifact Evaluation
Paper Artifacts
• Artifacts
– Support materials (source code, tools, benchmarks, datasets, models) required for reproducibility of claimed experimental results1
• Artifact Evaluation Process (AEP) at Euro-Par1
– Completely optional but highly recommended
– Evaluated by independent committee of experts
• Our motivation for Artifact Evaluation
– Making our research more accessible
– Documenting and organizing the research for future reference/extension
– Enhance credibility of the research claims
A Prediction Framework for Fast Sparse Triangular Solves 211 Euro-Par 2020 Call for Artifact Evaluation
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 22
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 22
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 22
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 22
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 22
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 22
A selection criterionArtifact should be self-contained
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 23
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 23
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 23
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 23
A selection criterionArtifacts with long running time will not be evaluated
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 23
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 23
Artifact Preparation for Evaluation
• File organization
A Prediction Framework for Fast Sparse Triangular Solves 23
A selection criterionEase of use
Artifact Preparation for Evaluation
• The Overview Document
A Prediction Framework for Fast Sparse Triangular Solves 24
Artifact Preparation for Evaluation
• The Overview Document
A Prediction Framework for Fast Sparse Triangular Solves 24
Artifact Preparation for Evaluation
• The Overview Document
A Prediction Framework for Fast Sparse Triangular Solves 24
Artifact Preparation for Evaluation
• The Overview Document
A Prediction Framework for Fast Sparse Triangular Solves 24
Artifact Preparation for Evaluation
• The Overview Document
A Prediction Framework for Fast Sparse Triangular Solves 24
Artifact Preparation for Evaluation
• The Overview Document
A Prediction Framework for Fast Sparse Triangular Solves 24
Artifact Preparation for Evaluation
• The Overview Document
A Prediction Framework for Fast Sparse Triangular Solves 24
Artifact Preparation for Evaluation
• Reproducing Paper Results
A Prediction Framework for Fast Sparse Triangular Solves 25
Artifact Preparation for Evaluation
• Reproducing Paper Results
A Prediction Framework for Fast Sparse Triangular Solves 25
Artifact Preparation for Evaluation
• Reproducing Paper Results
A Prediction Framework for Fast Sparse Triangular Solves 25
Artifact Preparation for Evaluation
• Reproducing Paper Results
A Prediction Framework for Fast Sparse Triangular Solves 25
Artifact Preparation for Evaluation
• Dataset Generation Guide
A Prediction Framework for Fast Sparse Triangular Solves 26
Artifact Preparation for Evaluation
• Dataset Generation Guide
A Prediction Framework for Fast Sparse Triangular Solves 26
Artifact Preparation for Evaluation
• Dataset Generation Guide
A Prediction Framework for Fast Sparse Triangular Solves 26
Artifact Preparation for Evaluation
• Dataset Generation Guide
A Prediction Framework for Fast Sparse Triangular Solves 26
Artifact Preparation for Evaluation
• Dataset Generation Guide
A Prediction Framework for Fast Sparse Triangular Solves 26
Artifact Preparation for Evaluation
• Dataset Generation Guide
A Prediction Framework for Fast Sparse Triangular Solves 26
Artifact Evaluation
• Final Remarks
– Artifact Evaluation, a time-consuming but rewarding process
– Artifact mirror:
– For questions, queries, suggestions, contact: [email protected]
A Prediction Framework for Fast Sparse Triangular Solves 27
https://github.com/ParCoreLab/SpTRSV_Framework
THANK YOU
A Prediction Framework for Fast Sparse Triangular Solves 28
References[1] A. Jamal et al., "A Hybrid CPU/GPU Approach for the Parallel Algebraic Recursive Multilevel Solver pARMS," 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, 2016
[2] Park, J. et al., “Sparsifying synchronization for high-performance shared-memory sparse triangular solver.” ISC, 2014
[3] Liu, W. et al., ”Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides.”, Concurrency Computat: Pract Exper. 2017
[4] Li et al., “Efficient parallel implementations of sparse triangular solves for gpu architectures.”, SIAM Conference on Parallel Processing for Scientific Computing, 2020
[5] Vuduc et al., “OSKI: A library of automatically tuned sparse matrix kernels.”, Journal of Physics: Conference Series, 2005
[6] Ansel, J. et al., “Petabricks: A language and compiler for algorithmic choice.”, SIGPLAN, 2009
[7] Muralidharan, S. et al., “Nitro: A framework for adaptive code variant tuning.”, IEEE 28th IPDPS, 2014
[8] Klie, H. et al., “Exploiting capabilities of many core platforms in reservoir simulation.”, In: SPE Reservoir Simulation Symposiu m, 2011
[9] Dufrechou, E. et al., “Automatic selection of sparse triangular linear system solvers on gpus through machine learning techniques”, International Symposium on Computer Architecture and High Performance Computing, 2019
A Prediction Framework for Fast Sparse Triangular Solves 29