locality-aware software throttling for sparse matrix ... · –split-join (sj) –split-join queue...

55
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1 , Ari B. Hayes 1 , Chi Zhang 2 , Timothy Salmon 1 , Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh

Upload: others

Post on 26-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Yanhao Chen1, Ari B. Hayes1, Chi Zhang2, Timothy Salmon1, Eddy Z. Zhang1

1. Rutgers University2. University of Pittsburgh

Page 2: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Sparse Matrix• Sparse Linear Systems

- CG- GMRES- …

• Physics Based Simulations- CFD

• Deep Learning Optimizations- Pruned Neural Networks

• ……

Page 3: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Sparse Matrix Operation

y=A xSparse Matrix

AInput Vector

xOutput Vector

y

&' = ()*+,) { .'/ ⨀ 1/}

Binary Operator3 ∈ 1, … ,8 , 9 ∈ [1, … , ;]

Page 4: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Sparse Matrix Vector Multiplication (SpMV)

!"#$%" = sum⨀ = ∗

,- = ./0{2-,4 ∗ 54}

78,8 78,9 78,: 78,;

< < < <

7:,8 < 7:,: <

< 7;,9 < 7;,;

=8=9=:=;

>8>9>:>;

* =

A x y

,- = !"#$%"{2-4⨀54}

,: = 2:,8 ∗ 58 + 2:,: ∗ 5:@ ∈ 1, … ,D , E ∈ [1, … , G]

Page 5: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Single Source Shortest Path (SSSP)

!"#$%" = min⨀ = +

,- = ./0{23- + 43}

,- = !"#$%"{2-3⨀43}

S

2

1

3

4

5

S 1 2 3 4 5S12345

67 = 89:{;7 , 2=,7+;= , 2>,7+;> }

Adjacent Matrix A

y: distance vector of j th iteration

x ∶ (previous) distance vector of j-1 th iterationA ∈ 1, … ,E , F ∈ [1, … , H]

Page 6: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Problem with Sparsity on GPUs• Low data reuse is always a big problem

• e.g. SpMV

– The input vector and the output vector can be reused a lot

– They are usually too large to fit into GPU’s cache

– The sparsity of the matrix causes irregular accesses of the vectors

– This means low reuse of the data in the cache

!" = $%&{(",* ∗ ,*}

Page 7: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Existing Methods to Improve Data Reuse on GPUs

• Warp Scheduling Policy– Throttling concurrent threads– Limits the number of active warps [Rogers+, MICRO’12]– DYNCTA: controls the number of CTAs [Kayiran+,PACT’13]

• Computation and Data Layout Transformation– Reduce irregular memory accesses– Improve Memory Coalescing [Zhang+, ICS’10]

Need Hardware Modification!

Only focus on Spatial Data Reuse!

Page 8: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Ø Is the First Software Throttling implementation

Ø Is focused on Temporal Data Reuse

Ø Exploits the Trade-off between throttling performance and GPU throughput

Page 9: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling !"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Page 10: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling !"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Matrix A is bypassing the cache

Page 11: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }

Original Case

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Matrix A is bypassing the cache

Running at one time

Page 12: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }

Original Case

Cache Capacity: 4

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

!" #" !$ !%!& #% #&

Matrix A is bypassing the cache

Running at one time

Page 13: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

< !% #" > < !% #& > < !& #" > < !& #& >

Tim

e

Throttling Phase 1

Cache Capacity: 4

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Page 14: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 1

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

Tim

e

!" #" !$ #$

All Data Can Fit into the Cache

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Page 15: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 2

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

Tim

e

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Page 16: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 2

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

Tim

e

!% #" !& #&

All Data Can Fit into the Cache

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Page 17: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

What We Need for Software Throttling• An effective partitioning algorithm

– All data items in one partition can fit into the cache– The interaction between different partitions are minimized

• Applicable scheduling methods– Handle the trade-off between throttling and throughput

Page 18: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

What We Need for Software Throttling• An effective partitioning algorithm

– All data items in one partition can fit into the cache– The interaction between different partitions are minimized

• Applicable scheduling methods– Handle the trade-off between throttling and throughput

Page 19: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Graph Representation• Graph Edge Partition Model

– Places an emphasis on Data– Node → Data object– Edge → Interaction between data

1 2 3 4

1 !"," !",$ !",% !",&2 ' ' ' '

3 !%," ' !%,% '

4 ' !&,$ ' !&,&

("($(%(&

)")$)%)&

X =

A X YX1 X2 X3 X4

Y1 Y2 Y3 Y4

Page 20: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Why Edge Partition Model?1. Better load balancing

– PowerGraph [OSDI’12], Streaming Edge Partition [KDD’14], SPAC[SIGMETRICS’17]

• Balanced vertex partition is sometimes NOT equal to balanced workload

2. Quantifying the communication cost

3. Applies to a large class of parallel applications– N-body, CFD, Sparse Linear Algebra, Graph Analytics, …

20

Page 21: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Different Edge Partition ModelsLoad Balanced Partition Data Balanced Partition

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition 1 Partition 2

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition 1 Partition 2

# Nodes: 4 # Nodes: 4# Edges: 3 # Edges: 3 Cache-Fit Partition

Page 22: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Cache Capacity: 4

Page 23: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A Partition B

Cache Capacity: 4

Page 24: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A Partition B

Cache Capacity: 4

Page 25: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

Cache Capacity: 4

X2 X3 X4

Y2 Y3 Y4

Partition B Partition C

Page 26: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

Cache Capacity: 4

Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A

Page 27: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

What We Need for Software Throttling• A good partitioning algorithm

– All data items in one partition can fit into the cache– The interaction between different partitions are minimized

• Applicable scheduling methods– Handle the trade-off between throttling and throughput

Page 28: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

Page 29: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

Page 30: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Cache-Fit (CF) Scheduling• Isolate the computation of different Cache-Fit Partitions• Run one Cache-Fit Partition at one time

TL: tuple list

N: # of tuples

TL’: new tuple list

Pi: # of tuples in TL[i]

Strict Barriers

CUDA Function

Page 31: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Low Pipeline Utilization

Kernel 1 -- Partition ACache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

X1

Y1

X1

Y2

X2

Y1

4 Working Threads

X2

Y2

Page 32: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Low Throughput

Cache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

X2

Y3

X3

Y3

4 Working Threads

Kernel 2 -- Partition B

IDLE

Page 33: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Low Pipeline Utilization

Cache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

X4

Y3

X4

Y4

4 Working Threads

Kernel 3 -- Partition C

IDLE

low pipeline utilization

Page 34: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

Page 35: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Cache-Fit Queue (CF-Q) Scheduling• Invoke a single kernel call but still

enable throttling

• Set up a FIFO queue • Each entry corresponds to a chunk

– A chunk is part of a cache-fit partition– Adjacent chunks are from the same

Cache-Fit Partition• Each warp fetches a chunk from the

queue and processes it

Cache-Fit Partition 1

Chunk 1

…Chunk 2

Chunk N

Que

ue

Cache-Fit Partition 2

Chunk 1

…Chunk 2

Chunk M

Page 36: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Cache-Fit Queue (CF-Q) Scheduling cont.

Cache-Fit Partition 1

Chunk 1

…Chunk 2

Chunk N

Que

ue

Cache-Fit Partition 2

Chunk 1

…Chunk 2

Chunk M

Tim

e

Warp 1 Warp 2

… …Chunk 1

Chunk 4

Chunk 2

Chunk 3

Chunk NChunk 1

Chunk 2Chunk 3

Relaxed Barrier

Chunk 3

Chunk 4

Page 37: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

Page 38: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Split-Join (SJ) Scheduling• Dynamically merge Cache-Fit Partitions

• Perform an Online Profiling to decide which partitions should be merged

• Use the Tree Representation of the data balanced partition to help the Online Profiling

Page 39: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Split-Join (SJ) Scheduling

Cache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Page 40: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Split-Join (SJ) Scheduling

A.stime: 0.5A.btime: 0.5

stime: measured standalone running timebtime: optimal running time on this node

B.stime: 0.2B.btime: 0.2

C.stime: 0.2C.btime: 0.2

All Edges

Partition A Partition B, C

Partition B Partition C

Page 41: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Split-Join (SJ) Schedulingstime: measured standalone running timebtime: optimal running time on this node

B.stime: 0.2B.btime: 0.2

C.stime: 0.2C.btime: 0.2

BC.stime: 0.3BC.btime: 0.3

All.stime: 1.2All.btime: 0.8

BC.stime < B.btime + C.btime

All Edges

Partition A Partition B, C

Partition B Partition C

All.stime > A.btime + BC.btime

A.stime: 0.5A.btime: 0.5

Page 42: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

Page 43: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Split-Join Queue (SJ-Q) Scheduling• Provide strict barriers between different merged partitions

• No barriers inside a merged partition of SJ– No guarantee of the execution order

• Set up one FIFO queue for each merged partition– Provide relaxed barriers between cache-fit partitions from the same

merged partition

Page 44: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Four Scheduling Methods Summary

Methods Pipeline Utilization Profiling Barriers Queue Code

ChangeCF Low No Strict No No

CF-Q High No Relaxed Yes YesSJ High Yes Strict No No

SJ-Q High Yes Strict / Relaxed Yes Yes

Page 45: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Software Throttling Performance• Experiment Settings

GPU Model Titan X GTX 745Architecture Pascal Maxwell

Core # 5376 576L2 Cache 3MB 2MB

CUDA Version CUDA 8.0 CUDA 8.0

CPU Intel XeonE5-2620

Intel Core i7-4790

Page 46: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Benchmarks• Sparse Linear Algebra Workloads

– Sparse Matrix Vector Multiplication (SPMV)

• Graph Processing Workloads– Bellman-Ford (BMF)– Page Rank (PR)

• Neural Network Optimization– Deep Compression: Pruned AlexNet

Page 47: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Sparse Matrix Vector Multiplication• Baseline: CUSP• Matrices: Florida Sparse Matrix Collection

– Focus on large matrices: working set cannot fit into L2 cache– 228 large matrices on GTX 745– 192 large matrices on Titan X

Page 48: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Overall SPMV Performance

2.018

0.81

1.21.41.61.8

2

Org + R CF SJ CF-Q SJ-Q

Norm

alize

d Sp

eedu

p

Average Speedup For SPMV

GTX 745Titan X

Page 49: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Effect of Cache Hit Rate

0

0.5

1

1.5

2

2.5

3

Org + R CF SJ CF-Q SJ-Q

Nor

mal

ized

Spe

edup

40% - 100%30% - 40%20% - 30%10% - 20%0% - 10%

4.16 4.24Titan X

Page 50: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Effect of Working Set Size

0

0.5

1

1.5

2

2.5

3

Org + R CF SJ CF-Q SJ-Q

Nor

mal

ized

Spe

edup 1X - 2X (cache size) 2X - 4X

4X - 8X 8X ~ INF

3.02

Titan X

Page 51: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Graph Application Performance

0

0.5

1

1.5

2

2.5

3

3.5

Pokec WebGoogle Wikipedia* WikiTalk IMDB RoadCentral RoadCal

Nor

mal

ized

Spee

dup

BMF & PR Performance using SJ w/ overhead

BMFPR

RoadCal is a small Graph

������������ � �� �������

Titan X

Page 52: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Graph Application Performance cont.

0102030405060

Pokec

WebGoogle

Wiki

pedia*

Wiki

TalkIM

DB

RoadCentral

RoadCalL2 C

ache

Hit

Rate

(%)

BMF Cache Hit Rate

Original SJ

05

101520253035

Pokec

WebGoogle

Wiki

pedia*

Wiki

TalkIM

DB

RoadCentral

RoadCalL2 C

ache

Hit

Rate

(%)

PR Cache Hit Rate

Original SJ

Titan XTitan X

RoadCal is a small Graph

Page 53: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Deep Learning Benchmark• Deep Compression [Han+,ICLR’16]

– Prune AlexNet to remove low weight elements in fully connected layers

– Deep Compression provide us sparse matrices

0

0.5

1

1.5

2

fc6 fc7 fc8 Combined

Norm

alize

d Sp

eedu

p

Speedup for AlexNet using Deep Compression

CFSJCF-QSJ-Q

Titan X

fc7 and fc8 has small vector size

Page 54: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference

Conclusion• We proposed the first locality-aware Software

Throttling framework for GPUs

• Our framework can increase data reuse by improving Temporal Locality

• We exploited the Trade-off between cache performance and pipeline utilization

Page 55: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working

2018 USENIX Annual Technical Conference