euro-par, 2006 ics 2009 a translation system for enabling data mining applications on gpus wenjing...

Euro-Par, 2006ICS 2009

A Translation System for Enabling Data Mining Applications on GPUs

Wenjing Ma Gagan Agrawal

The Ohio State University

ICS 2009

Euro-Par, 2006

Motivation and Overview

• Two Popular Trends

– Data-intensive computing

– GPU programming

• Seems like a good match

• Can we ease use of GPGPUs ?

– Domain-specific Programming Tool

– Can exploit common programming structure

– Enable good speedups

ICS 2009

Euro-Par, 2006

Context

• Many years of work on compiler and runtime support for data-intensive applications – Clusters, SMPs, Cluster of SMPs – FREERIDE and language front-ends

• Similar to map-reduce but … – Predates it and performs better !!

– Recent work on • (Cluster of) Multi-cores, Incorporate RSTM • GPUs – C and Matlab front-end

– Cluster of GPUs, Multi-core and GPUs

ICS 2009


Outline

• Background • GPU Computing • Parallel Data Mining

• Challenges of Data Mining on GPU• Architecture of the System

– Sequential code analysis– Generation of CUDA programs– Optimization Techniques

• Experimental Results– k-means, EM, PCA

• Related and future work

ICS 2009


Background - GPU Computing

• Many-core architectures/Accelerators are becoming more popular

• GPUs are inexpensive and fast• CUDA is a high-level language for GPU

programming


CUDA Programming

• Significant improvement over use of Graphics Libraries

• But .. • Need detailed knowledge of the architecture of GPU and a

new language • Must specify the grid configuration• Deal with memory allocation and movement• Explicit management of memory hierarchy


Parallel Data mining

• Common structure of data mining applications (FREERIDE)

• /* outer sequential loop *//* outer sequential loop */

while() {while() {

/* Reduction loop *//* Reduction loop */

Foreach (element e){Foreach (element e){

(i, val) = process(e);(i, val) = process(e);

Reduc(i) = Reduc(i) Reduc(i) = Reduc(i) opop val; val;

}}

Euro-Par, 2006

Porting on GPUs

• High-level Parallelization is straight-forward • Details of Data Movement • Impact of Thread Count on Reduction time• Use of shared memory


Architecture of the System

Variable information

Reduction functions

Optional functions Code

Analyzer( In LLVM)

Variable Analyzer

Code Generator

Variable Access

Pattern and Combination Operations

Host Program

Grid configuration

and kernel invocation

Kernel functions

Executable

User Input

Euro-Par, 2006

User Input

A sequential reduction function

Optional functions (initialization function, combination function…)

Values of each variable or size of array

Variables to be used in the reduction function


Analysis of Sequential Code

• Get the information of access features of each variable

• Determine the data to be replicated• Get the operator for global combination• Variables for shared memory

Euro-Par, 2006

Memory Allocation and Copy

Copy the updates back to host memory after the Copy the updates back to host memory after the kernel reduction function returnskernel reduction function returns

CC..

Need copy for each threadNeed copy for each thread

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

…… ……

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

……

AA..

BB..


Extract information of variable access

Variable analyzer

IR from LLVM

Extract variables to be written

Argument list

Extract read-only variables

User input

Extract temporary variables


Generating CUDA Code and C++/C code Invoking the Kernel Function

• Memory allocation and copy• Thread grid configuration (block number and

thread number)• Global function• Kernel reduction function• Global combination


Kernel Reduction Function

• Generated out of the original sequential code• Divide the main loop by block_number and

thread_number• Replace the access offsets with appropriate

indices


Optimizations

• Using shared memory• Providing user-specified initialization functions

and combination functions• Specifying variables that are allocated once


Dealing with Shared memory

• Size = length * sizeof(type) * thread_info– length: size of the array– type: char, int, and float– thread_info: whether it’s copied to each thread

• Mark each array as shared until the size exceeds the limit of shared memory


Shared memory layout Strategies

• No-sorting• Greedy sorting• Write-first sorting


No sorting

Shared Memory

BA C D


Greedy sorting

Shared Memory

BA C D

B AC D


Other Optimizations

• Reducing Memory allocation and copy overhead – Arrays shared by multiple iterations can be allocated

and copied only once

• User defined combination function


Applications

• K-means clustering• EM clustering• PCA


Experiment Results

Speedup of k-means


Speedup of EM


Speedup of PCA

Euro-Par, 2006

Related Work

• OpenMP to CUDA (Purdue) • Domain-specific operators to CUDA (NEC) • CUDA-lite etc. (Illinois) • Various application studies

Euro-Par, 2006

Conclusions

• Automatic CUDA Code Generation and Optimization is feasible

• Restricting to domain / communication style helps • Interesting new compiler optimizations

euro-par, 2006 ics 2009 a translation system for enabling data mining applications on gpus wenjing...

Documents