sponge: portable stream programming on graphics engines
Embed Size (px)
DESCRIPTION
Amir Hormati , Mehrzad Samadi , Mark Woh , Trevor Mudge , and Scott Mahlke. Sponge: Portable Stream Programming on Graphics Engines. Why GPUs?. Every mobile and desktop system will have one Affordable and high performance Over-provisioned Programmable. Sony PlayStation Phone. - PowerPoint PPT PresentationTRANSCRIPT

University of MichiganElectrical Engineering and Computer
Science
Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke
Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer
Science2
Why GPUs?• Every mobile and desktop
system will have one
• Affordable and high performance
• Over-provisioned
• Programmable
Sony PlayStation Phone
2002 2003 2004 2005 2006 2007 2008 2009 2010 20110
250
500
750
1000
1250
1500
NVIDIA GPU
INTEL CPUTh
eore
tical
GFL
OPS/
s
GeForce GTX 480
GeForce GTX 280
GeForce 8800 GTX
GeForce 7800 GTX GeForce 6800 Ultra

University of MichiganElectrical Engineering and Computer
Science3
GPU Architecture
Shared
Regs
0 1
2 3
4 56 7
Interconnection Network
CPU
SM 0 SM 1 SM 29
Kernel 1
Kernel 2
Tim
e 0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 56 7
Shared
Regs
0 1
2 3
4 56 7
Registers
Global Memory (Device Memory)
Shared Memory

University of MichiganElectrical Engineering and Computer
Science4
GPU Programming ModelPer-block Register
Grid 1
Grid 0
Per-appDevice Global
Memory
Grid Sequence
__shared__ int GlobalVar
Per-block Shared Memory
Block
__shared__ int SharedVar
int LocalVarArray[10]
int RegisterVar
Thread
Per-threadRegister
Thread
Per-threadLocal
Memory
Per-block Shared Memory
Thread
int LocalVarArray[10]
• Threads Blocks Grid
• All the threads run one kernel
• Registers private to each thread
• Registers spill to local memory
• Shared memory shared between threads of a block
• Global memory shared between all blocks

University of MichiganElectrical Engineering and Computer
Science5
Grid 1
GPU Execution Model
SM 1
Shared
Regs
0 1
2 3
4 5
6 7
SM 0
Shared
Regs
0 1
2 3
4 5
6 7
SM 2
Shared
Regs
0 1
2 3
4 5
6 7
SM 3
Shared
Regs
0 1
2 3
4 5
6 7
SM 30
Shared
Regs
0 1
2 3
4 5
6 7

University of MichiganElectrical Engineering and Computer
Science6
GPU Execution Model
Block 0
Block 1
Block 3
Shared
Registers
0 1
2
4 5
3
6 7
SM0
Block 2
Warp 0 Warp 1
ThreadId0 31 32 63

University of MichiganElectrical Engineering and Computer
Science7
GPU Programming Challenges
0
50
100
150
200
250
300
350
400
64
4832
16
8
Number of Registers Per Thread
Tim
e (m
s)
High Performance Desktop Mobile
Optimized forGeForce GTX 285
Optimized forGeForce 8400 GS
• Data restructuring for complex memory hierarchy efficiently– Global memory, Shared memory, Registers
• Partitioning work between CPU and GPU
• Lack of portability between different generations of GPU– Registers, active warps, size of global
memory, size of shared memory
• Will vary even more– Newer high performance cards e.g. NVIDA’s
Fermi– Mobile GPUs with less resources

University of MichiganElectrical Engineering and Computer
Science8
Nonlinear Optimization Space
[Ryoo , CGO ’08]
SAD Optimization Space
908 Configurations
We need higher level of abstraction!

University of MichiganElectrical Engineering and Computer
Science9
Goals
• Write-once parallel software
• Free the programmer from low-level details
(C + Pthreads) Shared Memory Processors
(C +Intrinsics) SIMD Engines
(Verilog/VHDL) FPGAs
(CUDA/OpenCL) GPUs
Parallel Specification

University of MichiganElectrical Engineering and Computer
Science10
Streaming
• Higher-level of abstraction
• Decoupling computation and memory accesses
• Coarse grain exposed parallelism, exposed communication
• Programmers can focus on the algorithms instead of low-level details
• Streaming actors use buffers to communicate
• A lot of recent works on extending portability of streaming applications
Actor 1
Actor 2 Actor 5
Splitter
Actor 4Actor 3
Joiner
Actor 6

University of MichiganElectrical Engineering and Computer
Science11
Sponge– Generating optimized CUDA for a wide
variety of GPU targets
– Perform an array of optimizations on stream graphs
– Optimizing and porting to different generations
– Utilize memory hierarchy (registers, shared memory, coallescing)
– Efficiently utilize streaming cores
Reorganization and Classification
Memory Layout
Graph
Restructuring
Register Optimization
Shared/Global Memory
Helper Threads
Bank Conflict Resolution
Loop Unrolling
Software Prefetching

University of MichiganElectrical Engineering and Computer
Science12
GPU Performance Model- Memory bound Kernels
M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7
≈ Memory Time
- Computation bound Kernels
M 0 M 1 M 4 M 5M 2 M 3 M 6 M 7
C 0 C1 C 2 C 3 C 4 C 5 C 6 C 7
≈ Computation Time
M CMemory Instructions Computation Instructions

University of MichiganElectrical Engineering and Computer
Science13
Actor Classification• High Traffic Actors(HiT)
– Large number of memory accesses per actor– Less number of threads with shared memory– Using shared memory underutilizes the processors
• Low Traffic Actors(LoT)– Less number of memory accesses per actor– More number of threads– Using shared memory increases the performance

University of MichiganElectrical Engineering and Computer
Science14
Thread 1 Thread 2 Thread 3Thread 0
1514131211109876543210
1514131211109876543210
Global Memory Accesses
A[4,4]
Global Memory
Global Memory2 6 10 14
2 6 10 141 5 9 13
1 5 9 13
0 4 8 12
0 4 8 12
3 7 11 15
3 7 11 15
• Large access latency
• Not access the words in sequence
• No coalescing
A[4,4] A[4,4] A[4,4]
A[i, j] Actor A has i pops and j pushes

University of MichiganElectrical Engineering and Computer
Science15
Thread 3Thread 2Thread 1Thread 0
Shared Memory
1514131211109876543210
1514131211109876543210
A[4,4] A[4,4] A[4,4] A[4,4]
Shared Memory
Shared Memory
1514131211109876543210
1514131211109876543210
Global To
Shared
Global To
Shared
Global To
Shared
Global To
Shared
Global Memory
Global Memory 3210
3210
7654
7654
111098
111098
15141312
15141312
3210
3210
7654
7654
111098
111098
15141312
15141312
Shared to
Global
Shared to
Global
Shared to
Global
Shared to
Global

University of MichiganElectrical Engineering and Computer
Science16
Using Shared Memory
• Shared memory is 100x faster than global memory
• Coalesce all global memory accesses
• Number of threads is limited by size of the shared memory.
For number of iterationsFor number of pops
For number of pushs
Shared Global
Shared Global
syncthreads
syncthreads
End Kernel
Begin Kernel <<<Blocks, Threads>>>:
Work
Begin Kernel <<<Blocks, Threads>>>:
For number of iterations
End Kernel
Work

University of MichiganElectrical Engineering and Computer
Science17
For number of iterations
syncthreads
syncthreads
If helper threads
If helper threads
If worker threads
End Kernel
Begin Kernel <<< Blocks, Threads + Helper >>>:
Work
Shared Global
Shared Global
Helper Threads
• Shared memory limits the number of threads.
• Underutilized processors can fetch data.
• All the helper threads are in one warp. (no control flow divergence)
Begin Kernel <<<Blocks, Threads>>>:
For number of iterations
End Kernel
Work

University of MichiganElectrical Engineering and Computer
Science18
For number of iterations
syncthreads
syncthreads
For number of pops
For number of pops
For number of pops
For number of pushs
Begin Kernel <<<Blocks, Threads>>>:
End Kernel
If not the last iteration
Work
Regs Global
Regs Global
Shared Regs
Shared Global
Data Prefetch• Better register utilization
• Data for iteration i+1 is moved to registers
• Data for iteration i is moved from register to shared memory
• Allows the GPU to overlap instructions
For number of iterationsFor number of pops
For number of pushs
Shared Global
Shared Global
syncthreads
syncthreads
End Kernel
Begin Kernel <<<Blocks, Threads>>>:
Work

University of MichiganElectrical Engineering and Computer
Science19
Loop unrolling• Similar to traditional unrolling
• Allows the GPU to overlap instructions
• Better register utilization
• Less loop control overhead
• Can also be applied to memory transfer loops
For number of iterations/2
Begin Kernel <<<Blocks, Threads>>>:
End Kernel
syncthreads
syncthreads
Work
Work
For number of popsShared Global
For number of popsShared Global
syncthreads
syncthreads
For number of pushsShared Global
For number of pushsShared Global

University of MichiganElectrical Engineering and Computer
Science20
Methodology
• Set of benchmarks from the StreamIt Suite• 3GHz Intel Core 2 Duo CPU with 6GB RAM• Nvidia Geforce GTX 285
StreamProcessors
Processor clock
Memory Configuration Memory Bandwidth
240 1476 MHz 2GB DDR3 159.0 GB/s

University of MichiganElectrical Engineering and Computer
Science21
Result (Baseline CPU)DCT
FFT
Matrix
Multipl
yMatr
ix Mult
iplyB
lock
Biton
ic
Batch
er
Radix
Merge
Sort
Compa
rision
Cou
nting
Vecto
r Add
Histog
ram
Aver
age
05
101520253035404550
With Transfer Without Transfer
Spee
dup(
X)
10
24

University of MichiganElectrical Engineering and Computer
Science22
Result (Baseline GPU)DCT
FFT
Matrix M
ultiply
Matrix M
ultiply
Block
Biton
ic
Batch
er
Radix
Merge S
ortCom
paris
ion C
ounti
ng
Vecto
r Add
Histogra
m
Avera
ge
0
1
2
3
4
5
6
7
Shared/Global Prefetch/Unrolling Helper Threads Graph Restructuring
Spee
dup(
X)
64%
3%16%16%

University of MichiganElectrical Engineering and Computer
Science23
Conclusion• Future systems will be heterogeneous
• GPUs are important part of such systems
• Programming complexity is a significant challenge
• Sponge automatically creates optimized CUDA code for a wide variety of GPU targets
• Provide portability by performing an array of optimizations on stream graphs

University of MichiganElectrical Engineering and Computer
Science24
Questions

University of MichiganElectrical Engineering and Computer
Science25
Spatial Intermediate Representation
• StreamIt• Main Constructs:
– Filter Encapsulate computation.
– Pipeline Expressing pipeline parallelism.– Splitjoin Expressing task-level parallelism.– Other constructs not relevant here
• Exposes different types of parallelism– Composable, hierarchical
• Stateful and stateless filters
pipeline
filter
splitjoin

University of MichiganElectrical Engineering and Computer
Science26
Nonlinear Optimization Space
[Ryoo , CGO ’08]
SAD Optimization Space
908 Configurations

University of MichiganElectrical Engineering and Computer
Science27
Thread 1 Thread 2Thread 0
Bank Conflict
765432101514131211109876543210
A[8,8] A[8,8] A[8,8]
Shared Memory
Shared Memory 765432101514131211109876543210
Conflict
0 8 0
0 8 0
1 9 1
1 9 1
2 10 2
2 10 2
27
data = buffer[BaseAddress + s * ThreadId]

University of MichiganElectrical Engineering and Computer
Science28
Thread 2Thread 1Thread 0
Removing Bank Conflict
765432101514131211109876543210
A[8,8] A[8,8] A[8,8]
Shared Memory
Shared Memory 765432101514131211109876543210
0 9 2
0 9 2
1 10 3
1 10 3
2 11 4
2 11 4
28
data = buffer[BaseAddress + s * ThreadId]
if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd