apogee: adaptive prefetching on gpu for energy efficiency
DESCRIPTION
APOGEE: Adaptive Prefetching on GPU for Energy Efficiency. Ankit Sethia 1 , Ganesh Dasika 2 , Mehrzad Samadi 1 , Scott Mahlke 1 1-University of Michigan 2-ARM R&D Austin. Introduction. High Throughput – 1 TeraFlop High Energy Efficiency Programmability. Further push efficiency? - PowerPoint PPT PresentationTRANSCRIPT
University of MichiganElectrical Engineering and Computer Science
APOGEE: Adaptive Prefetching on GPU for Energy Efficiency
Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1,Scott Mahlke1
1-University of Michigan2-ARM R&D Austin
1
University of MichiganElectrical Engineering and Computer Science2
Introduction
• High Throughput – 1 TeraFlop • High Energy Efficiency• Programmability
Further push efficiency?• High performance on mobile• Less cost in supercomputers
University of MichiganElectrical Engineering and Computer Science3
Warp Scheduler
ALU
Register File Banks
SFUSFU
Data Cache
To global memory
Mem Ctrlr
.… ….
….
.… ….
….
….
….
Background
• Hide latency by fine grained multi-threading
• 20,000 inflight threads• 2 MB on-chip register file• Management overhead• Scheduling• Divergence
University of MichiganElectrical Engineering and Computer Science4
Motivation - I
Too many warps decrease efficiency
Normalized % IncreaseSpeedup in Power
0
1
2
3
4
5
6
7
8
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
32 Warps16 Warps8 Warps4 Warps2 Warps
No
rma
lize
d S
pe
ed
up
No
rma
lize
d I
ncre
ase
in
P
ow
er
University of MichiganElectrical Engineering and Computer Science5
Motivation - II
1 2 4 8 16 320.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Norm. RF Access Rate Norm. per Register Access
# of Warps
Norm
alized
Reg
iste
r File A
ccesses
Rate
Norm
alized
per
reg
iste
r A
ccesses
Hardware added to hide memory latency has under-
utilization
University of MichiganElectrical Engineering and Computer Science
Warp Scheduler
ALU
Register File Banks
SFUSFU
Data Cache
To global memory
Mem Ctrlr
…
APOGEE: Overview
6
… … … … … … …
From global memory
Pre
fetc
her
• Prefetch data from memory to cache.
• Less latency to hide.
• Less Multithreading.
• Less register context.
University of MichiganElectrical Engineering and Computer Science7
Traditional CPU Prefetching
32 63
64 95
96 127
4032 4063
4064 4095
0 31W0
W1
W2
W3
W30
W31
64 -32 64
.
.
.
.
Stride cannot be found32
96 64
Next line cannot be timely prefetched
Both the traditional prefetching techniques
do not work
University of MichiganElectrical Engineering and Computer Science8
Fixed Offset Address Prefetching
32 63
64 95
96 127
4032 4063
4064 4095
0 31W0
W1
W2
W3
W30
W31
.
.
.
.
128 160W4160 191W5
Warp 0
Warp 1
64
64
64
64
Less warps enables more iterations with fixed
offset(stride*numWarp)
University of MichiganElectrical Engineering and Computer Science9
Timeliness in Prefetching
Prefetch sent to Memory
Prefetch recv from Memory
TimelyPrefetch
EarlyPrefetch
SlowPrefetch
Time
Load
00 01 10
- Increase distance of prefetching if next load happens in state 01
- Decrease distance of prefetching if correctly prefetched address in state 10 was a miss
SlowPrefetch
TimelyPrefetch
00
0110
Prefetch sent to Memory
Prefetch received from
Memory
New Load
University of MichiganElectrical Engineering and Computer Science10
Load PC
Address
Offset
Confidence
Tid Distance
PF Type
PF State
0x253ad
0xfcfeed
8 2 3 2 0 00
Prefetch Table
Current PC
- =
+
+>
Miss address of thread index 4
Prefetch Queue
Prefetch Address
Prefetch Enable
3
1
*# Threads
FOA Prefetching
University of MichiganElectrical Engineering and Computer Science11
GRAPHICS MEMORY ACCESSES
3 major type of accesses:• Fixed Offset Access: Data accessed by adjacent threads in
an array have fixed offset.• Thread Invariant Address: Same address is accessed by all
the threads in a warp.• Texture Access: Address accessed during texturing
operation.
ES HS RS IT ST WT WF MEAN0%
20%
40%
60%
80%
100%
OthersTextureTIAFOA
University of MichiganElectrical Engineering and Computer Science12
Thread Invariant Addresses
Timely Prefetch
Const.Ld
Time
Ld
Ld
Ld
PC3
PC2
PC1
PC0
Iteration 1
Slow Prefetch
Pf PC Address
Slow bit
TIA Prefetch Table
Prefetch Queue
Prefetch Address
Const.Ld
Ld
Ld
Ld
PC3
PC2
PC1
PC0
Iteration 3
Slow Prefetch
Const.Ld
Time
Ld
Ld
Ld
PC3
PC2
PC1
PC0
Iteration 5
PC1 0xabc 01PC2
Time
Iteration 2
University of MichiganElectrical Engineering and Computer Science13
Benchmarkso Mesa Driver, SPEC-Viewperf traces, MV5 GPGPU
benchmarks Performance Eval
o MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency.
o Prefetcher – 32 entryo Prefetch latency – 10 cycleso D-Cache – 64kB, 8 way, 32 bytes per line.
Power Evalo Dynamic power from analytical model. Hong et.
al(ISCA10).o Static power from published numbers and tools:
• FPU, SFU, Caches, Register File, Fetch/Decode/Schedule.
Experimental Evaluation
University of MichiganElectrical Engineering and Computer Science14
APOGEE PerformanceHS
WT
WF IT ST ES RS FFT FT HP
MS SP
GEO
Graphics GPGPU MEAN
00.20.40.60.81
1.21.41.61.82
SIMT_32 STRIDE_4 MTA_4 APOGEE_4
Perf
orm
ance
On average 19% improvement in speedup with APOGEE
University of MichiganElectrical Engineering and Computer Science15
APOGEE Performance -II
HS
WT
WF IT ST ES RS FFT FT HP
MS SP
GEO
Graphics GPGPU MEAN
00.20.40.60.81
1.21.41.61.82
SIMT_32 MTA_32 APOGEE_4
Perf
orm
ance
MTA with 32 warps is within 3% of APOGEE
University of MichiganElectrical Engineering and Computer Science16
1 2 3 4 5 6 7 8 9 101
1.05
1.1
1.15
1.2
1.25
1.3
1.35
Normalized Performance
Nor
mal
ized
Pow
er
+
x
32 warps
16 warps
8 warps4 warps
2 warps1 warp
# of warps
SIMT MTA APOGEE
Performance and Power
• 20% speedup over SIMT, with 14k less registers.
• Around 51% perf/Watt improvement.
University of MichiganElectrical Engineering and Computer Science17
Prefetcher Accuracy
HS
WT
WF IT ST ES RS FFT FT HP
MS SP
GEO
Graphics GPGPU MEAN
0102030405060708090
100
% o
f Cor
rect
Pre
dicti
on
APOGEE has 93.5% accuracy in prediction
University of MichiganElectrical Engineering and Computer Science18
D-Cache Miss Rate
HS WT WF IT ST ES RS FFT FT HP MS SPGraphics GPGPU
-20%
0%
20%
40%
60%
80%
100%
MTA_4 MTA_32 APOGEE
% R
educ
tion
in D
cach
e M
iss
Rate
Prefetching results in 80% reduction in cache accesses
University of MichiganElectrical Engineering and Computer Science19
Use of high multi-threading on GPUs is inefficient
Adaptive prefetching exploits the regular memory access pattern of GPU applications:o Adapt for the timeliness of prefetchingo Prefetch for two different patterns
Over 12 graphics and GPGPU benchmarks, 20% improvement in performance and 51% improvement in performance/Watt
Conclusion
University of MichiganElectrical Engineering and Computer Science20
Questions?