apogee: adaptive prefetching on gpu for energy efficiency

University of MichiganElectrical Engineering and Computer Science

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1,Scott Mahlke1

1-University of Michigan2-ARM R&D Austin

1

University of MichiganElectrical Engineering and Computer Science2

Introduction

• High Throughput – 1 TeraFlop • High Energy Efficiency• Programmability

Further push efficiency?• High performance on mobile• Less cost in supercomputers


Warp Scheduler

ALU

Register File Banks

SFUSFU

Data Cache

To global memory

Mem Ctrlr

.… ….

….

.… ….

….

….

….

Background

• Hide latency by fine grained multi-threading

• 20,000 inflight threads• 2 MB on-chip register file• Management overhead• Scheduling• Divergence


Motivation - I

Too many warps decrease efficiency

Normalized % IncreaseSpeedup in Power

0

1

2

3

4

5

6

7

8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

32 Warps16 Warps8 Warps4 Warps2 Warps

No

rma

lize

d S

pe

ed

up

No

rma

lize

d I

ncre

ase

in

P

ow

er


Motivation - II

1 2 4 8 16 320.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Norm. RF Access Rate Norm. per Register Access

# of Warps

Norm

alized

Reg

iste

r File A

ccesses

Rate

Norm

alized

per

reg

iste

r A

ccesses

Hardware added to hide memory latency has under-

utilization

University of MichiganElectrical Engineering and Computer Science

Warp Scheduler

ALU

Register File Banks

SFUSFU

Data Cache

To global memory

Mem Ctrlr

…

APOGEE: Overview

6

… … … … … … …

From global memory

Pre

fetc

her

• Prefetch data from memory to cache.

• Less latency to hide.

• Less Multithreading.

• Less register context.


Traditional CPU Prefetching

32 63

64 95

96 127

4032 4063

4064 4095

0 31W0

W1

W2

W3

W30

W31

64 -32 64

.

.

.

.

Stride cannot be found32

96 64

Next line cannot be timely prefetched

Both the traditional prefetching techniques

do not work


Fixed Offset Address Prefetching

32 63

64 95

96 127

4032 4063

4064 4095

0 31W0

W1

W2

W3

W30

W31

.

.

.

.

128 160W4160 191W5

Warp 0

Warp 1

64

64

64

64

Less warps enables more iterations with fixed

offset(stride*numWarp)


Timeliness in Prefetching

Prefetch sent to Memory

Prefetch recv from Memory

TimelyPrefetch

EarlyPrefetch

SlowPrefetch

Time

Load

00 01 10

- Increase distance of prefetching if next load happens in state 01

- Decrease distance of prefetching if correctly prefetched address in state 10 was a miss

SlowPrefetch

TimelyPrefetch

00

0110

Prefetch sent to Memory

Prefetch received from

Memory

New Load


Load PC

Address

Offset

Confidence

Tid Distance

PF Type

PF State

0x253ad

0xfcfeed

8 2 3 2 0 00

Prefetch Table

Current PC

- =

+

+>

Miss address of thread index 4

Prefetch Queue

Prefetch Address

Prefetch Enable

3

1

*# Threads

FOA Prefetching


GRAPHICS MEMORY ACCESSES

3 major type of accesses:• Fixed Offset Access: Data accessed by adjacent threads in

an array have fixed offset.• Thread Invariant Address: Same address is accessed by all

the threads in a warp.• Texture Access: Address accessed during texturing

operation.

ES HS RS IT ST WT WF MEAN0%

20%

40%

60%

80%

100%

OthersTextureTIAFOA


Thread Invariant Addresses

Timely Prefetch

Const.Ld

Time

Ld

Ld

Ld

PC3

PC2

PC1

PC0

Iteration 1

Slow Prefetch

Pf PC Address

Slow bit

TIA Prefetch Table

Prefetch Queue

Prefetch Address

Const.Ld

Ld

Ld

Ld

PC3

PC2

PC1

PC0

Iteration 3

Slow Prefetch

Const.Ld

Time

Ld

Ld

Ld

PC3

PC2

PC1

PC0

Iteration 5

PC1 0xabc 01PC2

Time

Iteration 2


Benchmarkso Mesa Driver, SPEC-Viewperf traces, MV5 GPGPU

benchmarks Performance Eval

o MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency.

o Prefetcher – 32 entryo Prefetch latency – 10 cycleso D-Cache – 64kB, 8 way, 32 bytes per line.

Power Evalo Dynamic power from analytical model. Hong et.

al(ISCA10).o Static power from published numbers and tools:

• FPU, SFU, Caches, Register File, Fetch/Decode/Schedule.

Experimental Evaluation


APOGEE PerformanceHS

WT

WF IT ST ES RS FFT FT HP

MS SP

GEO

Graphics GPGPU MEAN

00.20.40.60.81

1.21.41.61.82

SIMT_32 STRIDE_4 MTA_4 APOGEE_4

Perf

orm

ance

On average 19% improvement in speedup with APOGEE


APOGEE Performance -II

HS

WT


MS SP

GEO

Graphics GPGPU MEAN

00.20.40.60.81

1.21.41.61.82

SIMT_32 MTA_32 APOGEE_4

Perf

orm

ance

MTA with 32 warps is within 3% of APOGEE


1 2 3 4 5 6 7 8 9 101

1.05

1.1

1.15

1.2

1.25

1.3

1.35

Normalized Performance

Nor

mal

ized

Pow

er

+

x

32 warps

16 warps

8 warps4 warps

2 warps1 warp

# of warps

SIMT MTA APOGEE

Performance and Power

• 20% speedup over SIMT, with 14k less registers.

• Around 51% perf/Watt improvement.


Prefetcher Accuracy

HS

WT


MS SP

GEO

Graphics GPGPU MEAN

0102030405060708090

100

% o

f Cor

rect

Pre

dicti

on

APOGEE has 93.5% accuracy in prediction


D-Cache Miss Rate

HS WT WF IT ST ES RS FFT FT HP MS SPGraphics GPGPU

-20%

0%

20%

40%

60%

80%

100%

MTA_4 MTA_32 APOGEE

% R

educ

tion

in D

cach

e M

iss

Rate

Prefetching results in 80% reduction in cache accesses


Use of high multi-threading on GPUs is inefficient

Adaptive prefetching exploits the regular memory access pattern of GPU applications:o Adapt for the timeliness of prefetchingo Prefetch for two different patterns

Over 12 graphics and GPGPU benchmarks, 20% improvement in performance and 51% improvement in performance/Watt

Conclusion


Questions?

apogee: adaptive prefetching on gpu for energy efficiency

Documents