locality-aware software throttling for sparse matrix ... · –split-join (sj) –split-join queue...
TRANSCRIPT
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Yanhao Chen1, Ari B. Hayes1, Chi Zhang2, Timothy Salmon1, Eddy Z. Zhang1
1. Rutgers University2. University of Pittsburgh
2018 USENIX Annual Technical Conference
Sparse Matrix• Sparse Linear Systems
- CG- GMRES- …
• Physics Based Simulations- CFD
• Deep Learning Optimizations- Pruned Neural Networks
• ……
2018 USENIX Annual Technical Conference
Sparse Matrix Operation
y=A xSparse Matrix
AInput Vector
xOutput Vector
y
&' = ()*+,) { .'/ ⨀ 1/}
Binary Operator3 ∈ 1, … ,8 , 9 ∈ [1, … , ;]
2018 USENIX Annual Technical Conference
Sparse Matrix Vector Multiplication (SpMV)
!"#$%" = sum⨀ = ∗
,- = ./0{2-,4 ∗ 54}
78,8 78,9 78,: 78,;
< < < <
7:,8 < 7:,: <
< 7;,9 < 7;,;
=8=9=:=;
>8>9>:>;
* =
A x y
,- = !"#$%"{2-4⨀54}
,: = 2:,8 ∗ 58 + 2:,: ∗ 5:@ ∈ 1, … ,D , E ∈ [1, … , G]
2018 USENIX Annual Technical Conference
Single Source Shortest Path (SSSP)
!"#$%" = min⨀ = +
,- = ./0{23- + 43}
,- = !"#$%"{2-3⨀43}
S
2
1
3
4
5
S 1 2 3 4 5S12345
67 = 89:{;7 , 2=,7+;= , 2>,7+;> }
Adjacent Matrix A
y: distance vector of j th iteration
x ∶ (previous) distance vector of j-1 th iterationA ∈ 1, … ,E , F ∈ [1, … , H]
2018 USENIX Annual Technical Conference
Problem with Sparsity on GPUs• Low data reuse is always a big problem
• e.g. SpMV
– The input vector and the output vector can be reused a lot
– They are usually too large to fit into GPU’s cache
– The sparsity of the matrix causes irregular accesses of the vectors
– This means low reuse of the data in the cache
!" = $%&{(",* ∗ ,*}
2018 USENIX Annual Technical Conference
Existing Methods to Improve Data Reuse on GPUs
• Warp Scheduling Policy– Throttling concurrent threads– Limits the number of active warps [Rogers+, MICRO’12]– DYNCTA: controls the number of CTAs [Kayiran+,PACT’13]
• Computation and Data Layout Transformation– Reduce irregular memory accesses– Improve Memory Coalescing [Zhang+, ICS’10]
Need Hardware Modification!
Only focus on Spatial Data Reuse!
2018 USENIX Annual Technical Conference
Ø Is the First Software Throttling implementation
Ø Is focused on Temporal Data Reuse
Ø Exploits the Trade-off between throttling performance and GPU throughput
2018 USENIX Annual Technical Conference
SpMV with Software Throttling !"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
2018 USENIX Annual Technical Conference
SpMV with Software Throttling !"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
Matrix A is bypassing the cache
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }
Original Case
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
Matrix A is bypassing the cache
Running at one time
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }
Original Case
Cache Capacity: 4
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
!" #" !$ !%!& #% #&
Matrix A is bypassing the cache
Running at one time
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
< !% #" > < !% #& > < !& #" > < !& #& >
Tim
e
Throttling Phase 1
Cache Capacity: 4
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
Throttling Phase 1
< !% #" > < !% #& > < !& #" > < !& #& >
Cache Capacity: 4
Tim
e
!" #" !$ #$
All Data Can Fit into the Cache
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
Throttling Phase 2
< !% #" > < !% #& > < !& #" > < !& #& >
Cache Capacity: 4
Tim
e
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
Throttling Phase 2
< !% #" > < !% #& > < !& #" > < !& #& >
Cache Capacity: 4
Tim
e
!% #" !& #&
All Data Can Fit into the Cache
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
2018 USENIX Annual Technical Conference
What We Need for Software Throttling• An effective partitioning algorithm
– All data items in one partition can fit into the cache– The interaction between different partitions are minimized
• Applicable scheduling methods– Handle the trade-off between throttling and throughput
2018 USENIX Annual Technical Conference
What We Need for Software Throttling• An effective partitioning algorithm
– All data items in one partition can fit into the cache– The interaction between different partitions are minimized
• Applicable scheduling methods– Handle the trade-off between throttling and throughput
2018 USENIX Annual Technical Conference
Graph Representation• Graph Edge Partition Model
– Places an emphasis on Data– Node → Data object– Edge → Interaction between data
1 2 3 4
1 !"," !",$ !",% !",&2 ' ' ' '
3 !%," ' !%,% '
4 ' !&,$ ' !&,&
("($(%(&
)")$)%)&
X =
A X YX1 X2 X3 X4
Y1 Y2 Y3 Y4
2018 USENIX Annual Technical Conference
Why Edge Partition Model?1. Better load balancing
– PowerGraph [OSDI’12], Streaming Edge Partition [KDD’14], SPAC[SIGMETRICS’17]
• Balanced vertex partition is sometimes NOT equal to balanced workload
2. Quantifying the communication cost
3. Applies to a large class of parallel applications– N-body, CFD, Sparse Linear Algebra, Graph Analytics, …
20
2018 USENIX Annual Technical Conference
Different Edge Partition ModelsLoad Balanced Partition Data Balanced Partition
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition 1 Partition 2
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition 1 Partition 2
# Nodes: 4 # Nodes: 4# Edges: 3 # Edges: 3 Cache-Fit Partition
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Cache Capacity: 4
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition A Partition B
Cache Capacity: 4
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition A Partition B
Cache Capacity: 4
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
Cache Capacity: 4
X2 X3 X4
Y2 Y3 Y4
Partition B Partition C
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
Cache Capacity: 4
Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition A
2018 USENIX Annual Technical Conference
What We Need for Software Throttling• A good partitioning algorithm
– All data items in one partition can fit into the cache– The interaction between different partitions are minimized
• Applicable scheduling methods– Handle the trade-off between throttling and throughput
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
2018 USENIX Annual Technical Conference
Cache-Fit (CF) Scheduling• Isolate the computation of different Cache-Fit Partitions• Run one Cache-Fit Partition at one time
TL: tuple list
N: # of tuples
TL’: new tuple list
Pi: # of tuples in TL[i]
Strict Barriers
CUDA Function
2018 USENIX Annual Technical Conference
Low Pipeline Utilization
Kernel 1 -- Partition ACache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
X1
Y1
X1
Y2
X2
Y1
4 Working Threads
X2
Y2
2018 USENIX Annual Technical Conference
Low Throughput
Cache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
X2
Y3
X3
Y3
4 Working Threads
Kernel 2 -- Partition B
IDLE
2018 USENIX Annual Technical Conference
Low Pipeline Utilization
Cache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
X4
Y3
X4
Y4
4 Working Threads
Kernel 3 -- Partition C
IDLE
low pipeline utilization
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
2018 USENIX Annual Technical Conference
Cache-Fit Queue (CF-Q) Scheduling• Invoke a single kernel call but still
enable throttling
• Set up a FIFO queue • Each entry corresponds to a chunk
– A chunk is part of a cache-fit partition– Adjacent chunks are from the same
Cache-Fit Partition• Each warp fetches a chunk from the
queue and processes it
Cache-Fit Partition 1
Chunk 1
…Chunk 2
Chunk N
Que
ue
Cache-Fit Partition 2
Chunk 1
…Chunk 2
Chunk M
…
2018 USENIX Annual Technical Conference
Cache-Fit Queue (CF-Q) Scheduling cont.
Cache-Fit Partition 1
Chunk 1
…Chunk 2
Chunk N
Que
ue
Cache-Fit Partition 2
Chunk 1
…Chunk 2
Chunk M
…
Tim
e
Warp 1 Warp 2
…
…
… …Chunk 1
Chunk 4
Chunk 2
Chunk 3
Chunk NChunk 1
Chunk 2Chunk 3
Relaxed Barrier
Chunk 3
Chunk 4
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
2018 USENIX Annual Technical Conference
Split-Join (SJ) Scheduling• Dynamically merge Cache-Fit Partitions
• Perform an Online Profiling to decide which partitions should be merged
• Use the Tree Representation of the data balanced partition to help the Online Profiling
2018 USENIX Annual Technical Conference
Split-Join (SJ) Scheduling
Cache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
2018 USENIX Annual Technical Conference
Split-Join (SJ) Scheduling
A.stime: 0.5A.btime: 0.5
stime: measured standalone running timebtime: optimal running time on this node
B.stime: 0.2B.btime: 0.2
C.stime: 0.2C.btime: 0.2
All Edges
Partition A Partition B, C
Partition B Partition C
2018 USENIX Annual Technical Conference
Split-Join (SJ) Schedulingstime: measured standalone running timebtime: optimal running time on this node
B.stime: 0.2B.btime: 0.2
C.stime: 0.2C.btime: 0.2
BC.stime: 0.3BC.btime: 0.3
All.stime: 1.2All.btime: 0.8
BC.stime < B.btime + C.btime
All Edges
Partition A Partition B, C
Partition B Partition C
All.stime > A.btime + BC.btime
A.stime: 0.5A.btime: 0.5
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
2018 USENIX Annual Technical Conference
Split-Join Queue (SJ-Q) Scheduling• Provide strict barriers between different merged partitions
• No barriers inside a merged partition of SJ– No guarantee of the execution order
• Set up one FIFO queue for each merged partition– Provide relaxed barriers between cache-fit partitions from the same
merged partition
2018 USENIX Annual Technical Conference
Four Scheduling Methods Summary
Methods Pipeline Utilization Profiling Barriers Queue Code
ChangeCF Low No Strict No No
CF-Q High No Relaxed Yes YesSJ High Yes Strict No No
SJ-Q High Yes Strict / Relaxed Yes Yes
2018 USENIX Annual Technical Conference
Software Throttling Performance• Experiment Settings
GPU Model Titan X GTX 745Architecture Pascal Maxwell
Core # 5376 576L2 Cache 3MB 2MB
CUDA Version CUDA 8.0 CUDA 8.0
CPU Intel XeonE5-2620
Intel Core i7-4790
2018 USENIX Annual Technical Conference
Benchmarks• Sparse Linear Algebra Workloads
– Sparse Matrix Vector Multiplication (SPMV)
• Graph Processing Workloads– Bellman-Ford (BMF)– Page Rank (PR)
• Neural Network Optimization– Deep Compression: Pruned AlexNet
2018 USENIX Annual Technical Conference
Sparse Matrix Vector Multiplication• Baseline: CUSP• Matrices: Florida Sparse Matrix Collection
– Focus on large matrices: working set cannot fit into L2 cache– 228 large matrices on GTX 745– 192 large matrices on Titan X
2018 USENIX Annual Technical Conference
Overall SPMV Performance
2.018
0.81
1.21.41.61.8
2
Org + R CF SJ CF-Q SJ-Q
Norm
alize
d Sp
eedu
p
Average Speedup For SPMV
GTX 745Titan X
2018 USENIX Annual Technical Conference
Effect of Cache Hit Rate
0
0.5
1
1.5
2
2.5
3
Org + R CF SJ CF-Q SJ-Q
Nor
mal
ized
Spe
edup
40% - 100%30% - 40%20% - 30%10% - 20%0% - 10%
4.16 4.24Titan X
2018 USENIX Annual Technical Conference
Effect of Working Set Size
0
0.5
1
1.5
2
2.5
3
Org + R CF SJ CF-Q SJ-Q
Nor
mal
ized
Spe
edup 1X - 2X (cache size) 2X - 4X
4X - 8X 8X ~ INF
3.02
Titan X
2018 USENIX Annual Technical Conference
Graph Application Performance
0
0.5
1
1.5
2
2.5
3
3.5
Pokec WebGoogle Wikipedia* WikiTalk IMDB RoadCentral RoadCal
Nor
mal
ized
Spee
dup
BMF & PR Performance using SJ w/ overhead
BMFPR
RoadCal is a small Graph
������������ � �� �������
Titan X
2018 USENIX Annual Technical Conference
Graph Application Performance cont.
0102030405060
Pokec
WebGoogle
Wiki
pedia*
Wiki
TalkIM
DB
RoadCentral
RoadCalL2 C
ache
Hit
Rate
(%)
BMF Cache Hit Rate
Original SJ
05
101520253035
Pokec
WebGoogle
Wiki
pedia*
Wiki
TalkIM
DB
RoadCentral
RoadCalL2 C
ache
Hit
Rate
(%)
PR Cache Hit Rate
Original SJ
Titan XTitan X
RoadCal is a small Graph
2018 USENIX Annual Technical Conference
Deep Learning Benchmark• Deep Compression [Han+,ICLR’16]
– Prune AlexNet to remove low weight elements in fully connected layers
– Deep Compression provide us sparse matrices
0
0.5
1
1.5
2
fc6 fc7 fc8 Combined
Norm
alize
d Sp
eedu
p
Speedup for AlexNet using Deep Compression
CFSJCF-QSJ-Q
Titan X
fc7 and fc8 has small vector size
2018 USENIX Annual Technical Conference
Conclusion• We proposed the first locality-aware Software
Throttling framework for GPUs
• Our framework can increase data reuse by improving Temporal Locality
• We exploited the Trade-off between cache performance and pipeline utilization
2018 USENIX Annual Technical Conference