Optimizing OpenCL Kernelsfor Iterative Statistical Applications on GPUs
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox
{tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu2nd International Workshop on GPUs and Scientific Applications
Galveston Island, TX
Iterative Statistical Applications
• Consists of iterative computation and communication steps
• Growing set of applications– Clustering, data mining, machine learning & dimension
reduction applications– Driven by data deluge & emerging computation fields
Compute Communication Reduce/ barrier
New Iteration
Iterative Statistical Applications
• Data intensive• Larger loop-invariant data• Smaller loop-variant delta between iterations– Result of an iteration– Broadcast to all the workers of the next iteration
• High memory access to floating point operations ratio
Compute Communication Reduce/ barrier
New Iteration
Motivation
• Important set of applications• Increasing power and availability of GPGPU computing• Cloud Computing – Iterative MapReduce technologies– GPGPU computing in clouds
from http://aws.amazon.com/ec2/
Motivation
• A sample bioinformatics pipeline
Gene Sequences
Pairwise Alignment &
Distance Calculation
Distance Matrix
Clustering
Multi-Dimensional
Scaling
Visualization
Cluster Indices
Coordinates
3D Plot
O(NxN)
O(NxN)
O(NxN)
http://salsahpc.indiana.edu/
Overview
• Three iterative statistical kernels implemented using OpenCl– Kmeans Clustering– Multi Dimesional Scaling– PageRank
• Optimized by,– Reusing loop-invariant data– Utilizing different memory levels– Rearranging data storage layouts– Dividing work between CPU and GPU
OpenCL
• Cross platform, vendor neutral, open standard– GPGPU, multi-core CPU, FPGA…
• Supports parallel programming in heterogeneous environments
• Compute kernels– Based on C99– Basic unit of executable code
• Work items – Single element of the execution domain– Grouped in the work groups• Communication & synchronization within work groups
OpenCL Memory Hierarchy
Local Memory
Work Item 1
Work Item 2
Private Private
Compute Unit 1
Local Memory
Work Item 1
Work Item 2
Private Private
Compute Unit 2
Global GPU Memory
Constant Memory
CPU
Environment
• NVIDIA Tesla C1060– 240 scalar processors– 4GB global memory– 102 GB/sec peak memory bandwidth– 16KB shared memory per 8 cores– CUDA compute capability 1.3– Peak Performance
• 933 GFLOPS Single with SF• 622 GFLOPS Single MAD• 77.7 GFLOPS Double
KMeans Clustering
• Partition a given data set into disjoint clusters• Each iteration– Cluster assignment step– Centroid update step
• Flops per work item (3DM+M)D :number of dimensionsM :number of centroids
Re-using loop-invariant data
KMeansClustering Optimizations
• Naïve (with data re-using)
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
20
40
60
80
100
120
Number of Data Points
GFL
OPS
KMeansClustering Optimizations
• Data points copied to local memory
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
20
40
60
80
100
120Naïve (A)
Data in Local Memory(B)
Number of Data Points
GFL
OPS
KMeansClustering Optimizations
• Cluster centroid points copied to local memory
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
20
40
60
80
100
120Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)
Number of Data Points
GFL
OPS
KMeansClustering Optimizations
• Local memory data points in column major order
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
20
40
60
80
100
120 Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)D + Local Data Points Column Major
Number of Data Points
GFL
OPS
KMeansClustering Performance
• Varying number of clusters (centroids)
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
20
40
60
80
100
120
140
50
100
200
300
360
Number of Data Points
GFL
OPS
KMeansClustering Performance
• Varying number of dimensions
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
10
20
30
40
50
60
4D-300
2D-300
4D-100
2D-100
Number of Data Points
Spee
dup
(GPU
vs
Sing
le c
ore
CPU
)
KMeansClustering Performance• Increasing number of iterations
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
20406080
100120140
5 Iterations10 Iterations15 Iterations20 Iterations
Number of Data Points
GFL
OPS
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000.1
1
10
100
1000
100005 Iterations10 Iterations15 Iterations20 Iterations
Number of Data Points
Tim
e Pe
r It
erati
on (m
s)
KMeans Clustering Overhead
256 2560 25600 256000 2560000 25600000 2560000001
10
100
1000
10000
100000
0%
30%
60%
90%
120%
150%Double ComputeRegular (Single Compute)Compute OnlyOverhead
Number of Data Points
Multi Dimesional Scaling
• Map a data set in high dimensional space to a data set in lower dimensional space
• Use a NxN dissimilarity matrix as the input– Output usually in 3D (Nx3)
or 2D (Nx2) space• Flops per work item
(8DN+7N+3D+1) D : target dimension N : number of data points
• SMACOF MDS algorithm
http://salsahpc.indiana.edu/
MDS Optimizations• Re-using loop-invariant data
0 5000 10000 15000 200000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Number of Data Points (N)
Spee
dup
of C
achi
ng
MDS Optimizations• Naïve (with loop-invariant data reuse)
0 5000 10000 15000 200000
10
20
30
40
50
60
70
Number of Data Points (N)
Perf
orm
ance
(GFL
OPS
)
MDS Optimizations• Naïve (with loop-invariant data reuse)
0 5000 10000 15000 200000
10
20
30
40
50
60
70
Naïve
Results in Shared Mem
Number of Data Points (N)
Perf
orm
ance
(GFL
OPS
)
MDS Optimizations• Naïve (with loop-invariant data reuse)
0 5000 10000 15000 200000
10
20
30
40
50
60
70
Naïve
Results in Shared Mem
X(k) in shared mem
Number of Data Points (N)
Perf
orm
ance
(GFL
OPS
)
MDS Optimizations• Naïve (with loop-invariant data reuse)
0 5000 10000 15000 200000
10
20
30
40
50
60
70NaïveResults in Shared MemX(k) in shared memData Points Coalesed
Number of Data Points (N)
Perf
orm
ance
(GFL
OPS
)
MDS Performance• Increasing number of iterations
0 5000 10000 15000 200000
20406080
100120140160180
10 Iterations25 Iterations50 Iterations100 Iterations
Number of Data Points (N)
GPU
Spe
edup
0 5000 10000 15000 200000
10
20
30
40
50
60
7010 Iterations25 Iterations50 Iterations100 Iterations
Number of Data Points (N)
Perf
orm
ance
(GFL
OPS
)
MDS Overhead
64 5064 10064 15064 200641
10
100
1000
10000
100000
0%
12%
24%
36%
48%
60%Double ComputeRegular (Single Compute)Compute Only TimeOverhead
Number of Data Points (N)
Page Rank• Analyses the linkage information to measure the relative
importance
• Sparse matrix and vector multiplication
• Web graph– Very sparse– Power law
distribution
Sparse Matrix Representations
ELLPACK
Compressed Sparse Row (CSR)http://www.nvidia.com/docs/IO/66889/nvr-2008-004.pdf
PageRank implementations
10 25 50 75 100 125 1500
200
400
600
800
1000
1200
1400
1600
1800CPU only
K(i)<4 in ELL, K(i)>=4 in CPU
K(i)<7 in ELL, K(i)>=7 in CPU
K(i)<16 in ELL, K(i)>=16 in CPU
Number of Iterations
Tim
e (m
s)
Lessons
• Reusing of loop-invariant data• Leveraging local memory• Optimizing data layout• Sharing work between CPU & GPU
OpenCL experience
• Flexible programming environment• Support for work group level synchronization
primitives• Lack of debugging support• Lack of dynamic memory allocation• Compilation target than a user programming
environment?
Future Work
• Extending kernels to distributed environments• Comparing with CUDA implementations• Exploring more aggressive CPU/GPU sharing• Studying more application kernels• Data reuse in the pipeline
Acknowledgements
• This work was started as a class project for CSCI-B649:Parallel Architectures (spring 2010) at IU School of Informatics and Computing.
• Thilina was supported by National Institutes of Health grant 5 RC2 HG005806-02.
• We thank Sueng-Hee Bae, BingJing Zang, Li Hui and the Salsa group (http://salsahpc.indiana.edu/) for the algorithmic insights.
Questions
Thank You!
KMeansClustering Optimizations
• Data in global memory coalesced
256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000
20
40
60
80
100
120Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)
Number of Data Points
GFL
OPS