euro-par, 2006 ics 2009 a translation system for enabling data mining applications on gpus wenjing...
TRANSCRIPT
Euro-Par, 2006ICS 2009
A Translation System for Enabling Data Mining Applications on GPUs
Wenjing Ma Gagan Agrawal
The Ohio State University
ICS 2009
Euro-Par, 2006
Motivation and Overview
• Two Popular Trends
– Data-intensive computing
– GPU programming
• Seems like a good match
• Can we ease use of GPGPUs ?
– Domain-specific Programming Tool
– Can exploit common programming structure
– Enable good speedups
ICS 2009
Euro-Par, 2006
Context
• Many years of work on compiler and runtime support for data-intensive applications – Clusters, SMPs, Cluster of SMPs – FREERIDE and language front-ends
• Similar to map-reduce but … – Predates it and performs better !!
– Recent work on • (Cluster of) Multi-cores, Incorporate RSTM • GPUs – C and Matlab front-end
– Cluster of GPUs, Multi-core and GPUs
ICS 2009
Euro-Par, 2006ICS 2009
Outline
• Background • GPU Computing • Parallel Data Mining
• Challenges of Data Mining on GPU• Architecture of the System
– Sequential code analysis– Generation of CUDA programs– Optimization Techniques
• Experimental Results– k-means, EM, PCA
• Related and future work
ICS 2009
Euro-Par, 2006ICS 2009
Background - GPU Computing
• Many-core architectures/Accelerators are becoming more popular
• GPUs are inexpensive and fast• CUDA is a high-level language for GPU
programming
Euro-Par, 2006ICS 2009
CUDA Programming
• Significant improvement over use of Graphics Libraries
• But .. • Need detailed knowledge of the architecture of GPU and a
new language • Must specify the grid configuration• Deal with memory allocation and movement• Explicit management of memory hierarchy
Euro-Par, 2006ICS 2009
Parallel Data mining
• Common structure of data mining applications (FREERIDE)
• /* outer sequential loop *//* outer sequential loop */
while() {while() {
/* Reduction loop *//* Reduction loop */
Foreach (element e){Foreach (element e){
(i, val) = process(e);(i, val) = process(e);
Reduc(i) = Reduc(i) Reduc(i) = Reduc(i) opop val; val;
}}
Euro-Par, 2006
Porting on GPUs
• High-level Parallelization is straight-forward • Details of Data Movement • Impact of Thread Count on Reduction time• Use of shared memory
Euro-Par, 2006ICS 2009
Architecture of the System
Variable information
Reduction functions
Optional functions Code
Analyzer( In LLVM)
Variable Analyzer
Code Generator
Variable Access
Pattern and Combination Operations
Host Program
Grid configuration
and kernel invocation
Kernel functions
Executable
User Input
Euro-Par, 2006
User Input
A sequential reduction function
Optional functions (initialization function, combination function…)
Values of each variable or size of array
Variables to be used in the reduction function
Euro-Par, 2006ICS 2009
Analysis of Sequential Code
• Get the information of access features of each variable
• Determine the data to be replicated• Get the operator for global combination• Variables for shared memory
Euro-Par, 2006
Memory Allocation and Copy
Copy the updates back to host memory after the Copy the updates back to host memory after the kernel reduction function returnskernel reduction function returns
CC..
Need copy for each threadNeed copy for each thread
T0 T1 T2 T3 T4 T61 T62 T63 T0 T1
…… ……
T0 T1 T2 T3 T4 T61 T62 T63 T0 T1
……
AA..
BB..
Euro-Par, 2006ICS 2009
Extract information of variable access
Variable analyzer
IR from LLVM
Extract variables to be written
Argument list
Extract read-only variables
User input
Extract temporary variables
Euro-Par, 2006ICS 2009
Generating CUDA Code and C++/C code Invoking the Kernel Function
• Memory allocation and copy• Thread grid configuration (block number and
thread number)• Global function• Kernel reduction function• Global combination
Euro-Par, 2006ICS 2009
Kernel Reduction Function
• Generated out of the original sequential code• Divide the main loop by block_number and
thread_number• Replace the access offsets with appropriate
indices
Euro-Par, 2006ICS 2009
Optimizations
• Using shared memory• Providing user-specified initialization functions
and combination functions• Specifying variables that are allocated once
Euro-Par, 2006ICS 2009
Dealing with Shared memory
• Size = length * sizeof(type) * thread_info– length: size of the array– type: char, int, and float– thread_info: whether it’s copied to each thread
• Mark each array as shared until the size exceeds the limit of shared memory
Euro-Par, 2006ICS 2009
Shared memory layout Strategies
• No-sorting• Greedy sorting• Write-first sorting
Euro-Par, 2006ICS 2009
No sorting
Shared Memory
BA C D
Euro-Par, 2006ICS 2009
Greedy sorting
Shared Memory
BA C D
B AC D
Euro-Par, 2006ICS 2009
Other Optimizations
• Reducing Memory allocation and copy overhead – Arrays shared by multiple iterations can be allocated
and copied only once
• User defined combination function
Euro-Par, 2006ICS 2009
Applications
• K-means clustering• EM clustering• PCA
Euro-Par, 2006ICS 2009
Experiment Results
Speedup of k-means
Euro-Par, 2006ICS 2009
Speedup of EM
Euro-Par, 2006ICS 2009
Speedup of PCA
Euro-Par, 2006
Related Work
• OpenMP to CUDA (Purdue) • Domain-specific operators to CUDA (NEC) • CUDA-lite etc. (Illinois) • Various application studies
Euro-Par, 2006
Conclusions
• Automatic CUDA Code Generation and Optimization is feasible
• Restricting to domain / communication style helps • Interesting new compiler optimizations