apogee: adaptive prefetching on gpu for energy efficiency

20
University of Michigan Electrical Engineering and Computer Science APOGEE: Adaptive Prefetching on GPU for Energy Efficiency Ankit Sethia 1 , Ganesh Dasika 2 , Mehrzad Samadi 1 , Scott Mahlke 1 1-University of Michigan 2-ARM R&D Austin 1

Upload: arthur-casey

Post on 03-Jan-2016

95 views

Category:

Documents


1 download

DESCRIPTION

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency. Ankit Sethia 1 , Ganesh Dasika 2 , Mehrzad Samadi 1 , Scott Mahlke 1 1-University of Michigan 2-ARM R&D Austin. Introduction. High Throughput – 1 TeraFlop High Energy Efficiency Programmability. Further push efficiency? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1,Scott Mahlke1

1-University of Michigan2-ARM R&D Austin

1

Page 2: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science2

Introduction

• High Throughput – 1 TeraFlop • High Energy Efficiency• Programmability

Further push efficiency?• High performance on mobile• Less cost in supercomputers

Page 3: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science3

Warp Scheduler

ALU

Register File Banks

SFUSFU

Data Cache

To global memory

Mem Ctrlr

.… ….

….

.… ….

….

….

….

Background

• Hide latency by fine grained multi-threading

• 20,000 inflight threads• 2 MB on-chip register file• Management overhead• Scheduling• Divergence

Page 4: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science4

Motivation - I

Too many warps decrease efficiency

Normalized % IncreaseSpeedup in Power

0

1

2

3

4

5

6

7

8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

32 Warps16 Warps8 Warps4 Warps2 Warps

No

rma

lize

d S

pe

ed

up

No

rma

lize

d I

ncre

ase

in

P

ow

er

Page 5: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science5

Motivation - II

1 2 4 8 16 320.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Norm. RF Access Rate Norm. per Register Access

# of Warps

Norm

alized

Reg

iste

r File A

ccesses

Rate

Norm

alized

per

reg

iste

r A

ccesses

Hardware added to hide memory latency has under-

utilization

Page 6: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science

Warp Scheduler

ALU

Register File Banks

SFUSFU

Data Cache

To global memory

Mem Ctrlr

APOGEE: Overview

6

… … … … … … …

From global memory

Pre

fetc

her

• Prefetch data from memory to cache.

• Less latency to hide.

• Less Multithreading.

• Less register context.

Page 7: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science7

Traditional CPU Prefetching

32 63

64 95

96 127

4032 4063

4064 4095

0 31W0

W1

W2

W3

W30

W31

64 -32 64

.

.

.

.

Stride cannot be found32

96 64

Next line cannot be timely prefetched

Both the traditional prefetching techniques

do not work

Page 8: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science8

Fixed Offset Address Prefetching

32 63

64 95

96 127

4032 4063

4064 4095

0 31W0

W1

W2

W3

W30

W31

.

.

.

.

128 160W4160 191W5

Warp 0

Warp 1

64

64

64

64

Less warps enables more iterations with fixed

offset(stride*numWarp)

Page 9: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science9

Timeliness in Prefetching

Prefetch sent to Memory

Prefetch recv from Memory

TimelyPrefetch

EarlyPrefetch

SlowPrefetch

Time

Load

00 01 10

- Increase distance of prefetching if next load happens in state 01

- Decrease distance of prefetching if correctly prefetched address in state 10 was a miss

SlowPrefetch

TimelyPrefetch

00

0110

Prefetch sent to Memory

Prefetch received from

Memory

New Load

Page 10: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science10

Load PC

Address

Offset

Confidence

Tid Distance

PF Type

PF State

0x253ad

0xfcfeed

8 2 3 2 0 00

Prefetch Table

Current PC

- =

+

+>

Miss address of thread index 4

Prefetch Queue

Prefetch Address

Prefetch Enable

3

1

*# Threads

FOA Prefetching

Page 11: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science11

GRAPHICS MEMORY ACCESSES

3 major type of accesses:• Fixed Offset Access: Data accessed by adjacent threads in

an array have fixed offset.• Thread Invariant Address: Same address is accessed by all

the threads in a warp.• Texture Access: Address accessed during texturing

operation.

ES HS RS IT ST WT WF MEAN0%

20%

40%

60%

80%

100%

OthersTextureTIAFOA

Page 12: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science12

Thread Invariant Addresses

Timely Prefetch

Const.Ld

Time

Ld

Ld

Ld

PC3

PC2

PC1

PC0

Iteration 1

Slow Prefetch

Pf PC Address

Slow bit

TIA Prefetch Table

Prefetch Queue

Prefetch Address

Const.Ld

Ld

Ld

Ld

PC3

PC2

PC1

PC0

Iteration 3

Slow Prefetch

Const.Ld

Time

Ld

Ld

Ld

PC3

PC2

PC1

PC0

Iteration 5

PC1 0xabc 01PC2

Time

Iteration 2

Page 13: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science13

Benchmarkso Mesa Driver, SPEC-Viewperf traces, MV5 GPGPU

benchmarks Performance Eval

o MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency.

o Prefetcher – 32 entryo Prefetch latency – 10 cycleso D-Cache – 64kB, 8 way, 32 bytes per line.

Power Evalo Dynamic power from analytical model. Hong et.

al(ISCA10).o Static power from published numbers and tools:

• FPU, SFU, Caches, Register File, Fetch/Decode/Schedule.

Experimental Evaluation

Page 14: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science14

APOGEE PerformanceHS

WT

WF IT ST ES RS FFT FT HP

MS SP

GEO

Graphics GPGPU MEAN

00.20.40.60.81

1.21.41.61.82

SIMT_32 STRIDE_4 MTA_4 APOGEE_4

Perf

orm

ance

On average 19% improvement in speedup with APOGEE

Page 15: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science15

APOGEE Performance -II

HS

WT

WF IT ST ES RS FFT FT HP

MS SP

GEO

Graphics GPGPU MEAN

00.20.40.60.81

1.21.41.61.82

SIMT_32 MTA_32 APOGEE_4

Perf

orm

ance

MTA with 32 warps is within 3% of APOGEE

Page 16: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science16

1 2 3 4 5 6 7 8 9 101

1.05

1.1

1.15

1.2

1.25

1.3

1.35

Normalized Performance

Nor

mal

ized

Pow

er

+

x

32 warps

16 warps

8 warps4 warps

2 warps1 warp

# of warps

SIMT MTA APOGEE

Performance and Power

• 20% speedup over SIMT, with 14k less registers.

• Around 51% perf/Watt improvement.

Page 17: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science17

Prefetcher Accuracy

HS

WT

WF IT ST ES RS FFT FT HP

MS SP

GEO

Graphics GPGPU MEAN

0102030405060708090

100

% o

f Cor

rect

Pre

dicti

on

APOGEE has 93.5% accuracy in prediction

Page 18: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science18

D-Cache Miss Rate

HS WT WF IT ST ES RS FFT FT HP MS SPGraphics GPGPU

-20%

0%

20%

40%

60%

80%

100%

MTA_4 MTA_32 APOGEE

% R

educ

tion

in D

cach

e M

iss

Rate

Prefetching results in 80% reduction in cache accesses

Page 19: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science19

Use of high multi-threading on GPUs is inefficient

Adaptive prefetching exploits the regular memory access pattern of GPU applications:o Adapt for the timeliness of prefetchingo Prefetch for two different patterns

Over 12 graphics and GPGPU benchmarks, 20% improvement in performance and 51% improvement in performance/Watt

Conclusion

Page 20: APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

University of MichiganElectrical Engineering and Computer Science20

Questions?