Download - Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

Optimizing OpenCL Kernelsfor Iterative Statistical Applications on GPUs

Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox

{tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu2nd International Workshop on GPUs and Scientific Applications

Galveston Island, TX

Iterative Statistical Applications

• Consists of iterative computation and communication steps

• Growing set of applications– Clustering, data mining, machine learning & dimension

reduction applications– Driven by data deluge & emerging computation fields

Compute Communication Reduce/ barrier

New Iteration

Iterative Statistical Applications

• Data intensive• Larger loop-invariant data• Smaller loop-variant delta between iterations– Result of an iteration– Broadcast to all the workers of the next iteration

• High memory access to floating point operations ratio

Compute Communication Reduce/ barrier

New Iteration

Motivation

• Important set of applications• Increasing power and availability of GPGPU computing• Cloud Computing – Iterative MapReduce technologies– GPGPU computing in clouds

from http://aws.amazon.com/ec2/

http://aws.amazon.com/ec2/



Motivation

• A sample bioinformatics pipeline

Gene Sequences

Pairwise Alignment &

Distance Calculation

Distance Matrix

Clustering

Multi-Dimensional

Scaling

Visualization

Cluster Indices

Coordinates

3D Plot

O(NxN)

O(NxN)

O(NxN)

http://salsahpc.indiana.edu/

Overview

• Three iterative statistical kernels implemented using OpenCl– Kmeans Clustering– Multi Dimesional Scaling– PageRank

• Optimized by,– Reusing loop-invariant data– Utilizing different memory levels– Rearranging data storage layouts– Dividing work between CPU and GPU

OpenCL

• Cross platform, vendor neutral, open standard– GPGPU, multi-core CPU, FPGA…

• Supports parallel programming in heterogeneous environments

• Compute kernels– Based on C99– Basic unit of executable code

• Work items – Single element of the execution domain– Grouped in the work groups• Communication & synchronization within work groups

OpenCL Memory Hierarchy

Local Memory

Work Item 1

Work Item 2

Private Private

Compute Unit 1

Local Memory

Work Item 1

Work Item 2

Private Private

Compute Unit 2

Global GPU Memory

Constant Memory

CPU

Environment

• NVIDIA Tesla C1060– 240 scalar processors– 4GB global memory– 102 GB/sec peak memory bandwidth– 16KB shared memory per 8 cores– CUDA compute capability 1.3– Peak Performance

• 933 GFLOPS Single with SF• 622 GFLOPS Single MAD• 77.7 GFLOPS Double

KMeans Clustering

• Partition a given data set into disjoint clusters• Each iteration– Cluster assignment step– Centroid update step

• Flops per work item (3DM+M)D :number of dimensionsM :number of centroids

Re-using loop-invariant data

KMeansClustering Optimizations

• Naïve (with data re-using)

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120

Number of Data Points

GFL

OPS


• Data points copied to local memory

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120Naïve (A)

Data in Local Memory(B)


GFL

OPS


• Cluster centroid points copied to local memory

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)


GFL

OPS


• Local memory data points in column major order

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120 Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)D + Local Data Points Column Major


GFL

OPS

KMeansClustering Performance

• Varying number of clusters (centroids)

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120

140

50

100

200

300

360


GFL

OPS

KMeansClustering Performance

• Varying number of dimensions

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

10

20

30

40

50

60

4D-300

2D-300

4D-100

2D-100


Spee

dup

(GPU

vs

Sing

le c

ore

CPU

)

KMeansClustering Performance• Increasing number of iterations

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20406080

100120140

5 Iterations10 Iterations15 Iterations20 Iterations


GFL

OPS

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000.1

1

10

100

1000



Tim

e Pe

r It

erati

on (m

s)

KMeans Clustering Overhead

256 2560 25600 256000 2560000 25600000 2560000001

10

100

1000

10000

100000

0%

30%

60%

90%

120%

150%Double ComputeRegular (Single Compute)Compute OnlyOverhead


Multi Dimesional Scaling

• Map a data set in high dimensional space to a data set in lower dimensional space

• Use a NxN dissimilarity matrix as the input– Output usually in 3D (Nx3)

or 2D (Nx2) space• Flops per work item

(8DN+7N+3D+1) D : target dimension N : number of data points

• SMACOF MDS algorithm

http://salsahpc.indiana.edu/

MDS Optimizations• Re-using loop-invariant data

0 5000 10000 15000 200000%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Number of Data Points (N)

Spee

dup

of C

achi

ng

MDS Optimizations• Naïve (with loop-invariant data reuse)

0 5000 10000 15000 200000

10

20

30

40

50

60

70


Perf

orm

ance

(GFL

OPS

)


0 5000 10000 15000 200000

10

20

30

40

50

60

70

Naïve

Results in Shared Mem


Perf

orm

ance

(GFL

OPS

)


0 5000 10000 15000 200000

10

20

30

40

50

60

70

Naïve

Results in Shared Mem

X(k) in shared mem


Perf

orm

ance

(GFL

OPS

)


0 5000 10000 15000 200000

10

20

30

40

50

60

70NaïveResults in Shared MemX(k) in shared memData Points Coalesed


Perf

orm

ance

(GFL

OPS

)

MDS Performance• Increasing number of iterations

0 5000 10000 15000 200000

20406080

100120140160180



GPU

Spe

edup

0 5000 10000 15000 200000

10

20

30

40

50

60



Perf

orm

ance

(GFL

OPS

)

MDS Overhead

64 5064 10064 15064 200641

10

100

1000

10000

100000

0%

12%

24%

36%

48%

60%Double ComputeRegular (Single Compute)Compute Only TimeOverhead


Page Rank• Analyses the linkage information to measure the relative

importance

• Sparse matrix and vector multiplication

• Web graph– Very sparse– Power law

distribution

Sparse Matrix Representations

ELLPACK

Compressed Sparse Row (CSR)http://www.nvidia.com/docs/IO/66889/nvr-2008-004.pdf

PageRank implementations

10 25 50 75 100 125 1500

200

400

600

800

1000

1200

1400

1600

1800CPU only

K(i)<4 in ELL, K(i)>=4 in CPU



Number of Iterations

Tim

e (m

s)

Lessons

• Reusing of loop-invariant data• Leveraging local memory• Optimizing data layout• Sharing work between CPU & GPU

OpenCL experience

• Flexible programming environment• Support for work group level synchronization

primitives• Lack of debugging support• Lack of dynamic memory allocation• Compilation target than a user programming

environment?

Future Work

• Extending kernels to distributed environments• Comparing with CUDA implementations• Exploring more aggressive CPU/GPU sharing• Studying more application kernels• Data reuse in the pipeline

Acknowledgements

• This work was started as a class project for CSCI-B649:Parallel Architectures (spring 2010) at IU School of Informatics and Computing.

• Thilina was supported by National Institutes of Health grant 5 RC2 HG005806-02.

• We thank Sueng-Hee Bae, BingJing Zang, Li Hui and the Salsa group (http://salsahpc.indiana.edu/) for the algorithmic insights.

Questions

Thank You!


• Data in global memory coalesced

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)


GFL

OPS

Download - Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

Top Related