teaching old caches new tricks: predictor virtualization

39
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi (CMU) and Babak Falsafi (EPFL)

Upload: alaqua

Post on 24-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Teaching Old Caches New Tricks: Predictor Virtualization. Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi (CMU) and Babak Falsafi (EPFL). Prediction The Way Forward. Prediction has proven useful – Many forms – Which to choose?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Teaching Old Caches New Tricks: Predictor  Virtualization

Teaching Old Caches New Tricks:Predictor Virtualization

Andreas MoshovosUniv. of Toronto

Ioana Burcea’s Thesis workSome parts joint with Stephen Somogyi (CMU) and Babak

Falsafi (EPFL)

Page 2: Teaching Old Caches New Tricks: Predictor  Virtualization

2Prediction The Way Forward

CPU Predictors

PrefetchingBranch Target and DirectionCache ReplacementCache Hit

• Application footprints grow• Predictors need to scale to remain effective• Ideally, fast, accurate predictions•Can’t have this with conventional technology

Prediction has proven useful – Many forms – Which to choose?

Page 3: Teaching Old Caches New Tricks: Predictor  Virtualization

3The Problem with Conventional Predictors

Predictor Virtualization Approximate Large, Accurate, Fast Predictors

Predictor

Hardware Cost

Accuracy Latency

• What we have• Small• Fast • Not-so-accurate

• What we want• Small• Fast • Accurate

Page 4: Teaching Old Caches New Tricks: Predictor  Virtualization

4Why Now?

L2 Cache

Physical Memory

CPU CPU CPU CPU

10-100MB

I$D$ I$D$ I$D$ I$D$

Extra Resources: CMPs with Large Caches

Page 5: Teaching Old Caches New Tricks: Predictor  Virtualization

5

L2 Cache

Predictor Virtualization (PV)

Use the on-chip cache to store metadataReduce cost of dedicated predictors

Physical Memory

CPU CPU CPU CPU

I$D$ I$D$ I$D$ I$D$

Page 6: Teaching Old Caches New Tricks: Predictor  Virtualization

6

L2 Cache

Predictor Virtualization (PV)

Use the on-chip cache to store metadataImplement otherwise impractical predictors

Physical Memory

CPU CPU CPU CPU

I$D$ I$D$ I$D$ I$D$

Page 7: Teaching Old Caches New Tricks: Predictor  Virtualization

7Research Overview• PV breaks the conventional predictor design trade offs

– Lowers cost of adoption– Facilitates implementation of otherwise impractical predictors

• Freeloads on existing resources– Adaptive demand

• Key Design Challenge: – How to compensate for the longer latency to metadata

• PV in action– Virtualized “Spatial Memory Streaming”– Virtualized Branch Target Buffers

Page 8: Teaching Old Caches New Tricks: Predictor  Virtualization

8

• PV Architecture• PV in Action

– Virtualizing “Spatial Memory Streaming”– Virtualizing Branch Target Buffers

• Conclusions

Talk Roadmap

Page 9: Teaching Old Caches New Tricks: Predictor  Virtualization

9PV Architecture

CPU

I$D$

L2 Cache

Physical Memory

OptimizationEngine

PredictorTable

request prediction

Virtualize

Page 10: Teaching Old Caches New Tricks: Predictor  Virtualization

10PV Architecture

CPU

I$D$

L2 Cache

Physical Memory

OptimizationEngine

PVCache

request predictionPVProxy

PVTable

Requires access to L2Back-side of L1Not as performance critical

Page 11: Teaching Old Caches New Tricks: Predictor  Virtualization

11PV Challenge: Prediction Latency

CPU

I$D$

L2 Cache

Physical Memory

OptimizationEngine

PVCache

request predictionPVProxy

PVTable

CommonCase

Infrequentlatency: 12-18 cycles

Rarelatency: 400 cycles

Key: How to pack metadata in L2 cache blocks to amortize costs

Page 12: Teaching Old Caches New Tricks: Predictor  Virtualization

12To Virtualize or Not To Virtualize• Predictors redesigned with PV in mind

• Overcoming the latency challenge– Metadata reuse

• Intrinsic: one entry used for multiple predictions• Temporal: one entry reused in the near future• Spatial: one miss overcome by several subsequent hits

– Metadata access pattern predictability• Predictor metadata prefetching

– Looks similar to designing caches• BUT:

– Does not have to be correct all the time– Time limit on usefullnes

Page 13: Teaching Old Caches New Tricks: Predictor  Virtualization

13PV in Action

• Data prefetching– Virtualize “Spatial Memory Streaming” [ISCA06]

• Within 1% performance• Hardware cost from 60KB down to < 1KB

• Branch prediction– Virtualize branch target buffers

• Increase the perceived BTB accuracy• Up to 12.75% IPC improvement with 8% hardware overhead

Page 14: Teaching Old Caches New Tricks: Predictor  Virtualization

14Spatial Memory Streaming [ISCA06]M

emor

y

1100001010001…

1101100000001…

spatial patterns

Pattern History Table

[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

Page 15: Teaching Old Caches New Tricks: Predictor  Virtualization

15Spatial Memory Streaming (SMS)

Detector Predictordata accessstream

patterns

trigger access

pattern

prefetches

~1KB ~60KBVirtualize

Page 16: Teaching Old Caches New Tricks: Predictor  Virtualization

16Virtualizing SMS

VirtualTable

patterntag patterntag patterntag...

unused

11 ways

1K sets

PVCache

11 ways

8 sets

L2 cache line

Region-level prefetching is naturally tolerant of longer prediction latenciesSimply pack predictor entries spatially

Page 17: Teaching Old Caches New Tricks: Predictor  Virtualization

17Experimental Methodology• SimFlex:

– full-system, cycle-accurate simulator• Baseline processor configuration

– 4-core CMP - OoO– L1D/L1I 64KB 4-way set-associative– UL2 8MB 16-way set-associative

• Commercial Workloads– Web servers: Apache and Zeus– TPC-C: DB2 and Oracle– TPC-H: several queries– Developed by Impetus group at CMU

• Anastasia Ailamaki & Babak Falsafi PIs

Page 18: Teaching Old Caches New Tricks: Predictor  Virtualization

SMS Performance Potential

0

20

40

60

80

100

120

Infin

ite1K

- 16

a1K

- 11

a51

2-11

a25

6-11

a12

8-11

a64

-11a

32-1

1a16

- 11

a8

- 11a

Infin

ite1K

- 16

a1K

- 11

a51

2-11

a25

6-11

a12

8-11

a64

-11a

32-1

1a16

- 11

a8

- 11a

Infin

ite1K

- 16

a1K

- 11

a51

2-11

a25

6-11

a12

8-11

a64

-11a

32-1

1a16

- 11

a8

- 11a

Apache Oracle Qry 17

Covered Uncovered Overpredictions

Per

cent

age

L1 R

ead

Mis

es (%

)

Conventional Predictor Degrades with Limited Storage

Page 19: Teaching Old Caches New Tricks: Predictor  Virtualization

19Virtualized SMS

Hardware CostOriginal Prefetcher ~ 60KBVirtualized Prefetcher < 1KB

Spe

edup

better

Page 20: Teaching Old Caches New Tricks: Predictor  Virtualization

Impact of Virtualization on L2 Requests

0

5

10

15

20

25

30

35

40

45

Apache Oracle Qry 17

PV-8 PV-16P

erce

ntag

e In

crea

se L

2 R

eque

sts

(%)

Page 21: Teaching Old Caches New Tricks: Predictor  Virtualization

Impact of Virtualization on Off-Chip Bandwidth

0%

1%

1%

2%

2%

3%

3%

4%

4%

5%

PV-8 PV-16 PV-8 PV-16 PV-8 PV-16

Apache Oracle Qry17

L2 Misses L2 Write backs

Off-

Chi

p B

andw

idth

Incr

ease

Page 22: Teaching Old Caches New Tricks: Predictor  Virtualization

22PV in Action

• Data prefetching– Virtualize “Spatial Memory Streaming” [ISCA06]

• Same performance• Hardware cost from 60KB down to < 1KB

• Branch prediction– Virtualize branch target buffers

• Increase the perceived BTB capacity• Up to 12.75% IPC improvement with 8% hardware overhead

Page 23: Teaching Old Caches New Tricks: Predictor  Virtualization

23 The Need for Larger BTBs

Bra

nch

MP

KI

better

Commercial applications benefit from large BTBs

BTB entries

Page 24: Teaching Old Caches New Tricks: Predictor  Virtualization

24

L2 Cache

Virtualizing BTBs: Phantom-BTB

BTBPC Virtual

Table

• Latency challenge• Not latency tolerant to longer prediction latencies

• Solution: predictor metadata prefetching• Virtual table decoupled from the BTB• Virtual table entry: temporal group

Small and Fast Large and Slow

Page 25: Teaching Old Caches New Tricks: Predictor  Virtualization

Facilitating Metadata Prefetching• Intuition: Programs follow mostly similar paths

Detection path Subsequent path

Page 26: Teaching Old Caches New Tricks: Predictor  Virtualization

26Temporal Groups

Past misses Good indicator of future missesDedicated Predictor acts as a filter

Page 27: Teaching Old Caches New Tricks: Predictor  Virtualization

27Fetch Trigger

Preceding miss triggers temporal group fetchNot precise region around miss

Page 28: Teaching Old Caches New Tricks: Predictor  Virtualization

28Temporal Group Prefetching

Page 29: Teaching Old Caches New Tricks: Predictor  Virtualization

29Temporal Group Prefetching

Page 30: Teaching Old Caches New Tricks: Predictor  Virtualization

30

L2 Cache

TemporalGroup Generator

Phantom-BTB Architecture

BTBPC

Prefetch Engine

• Temporal Group Generator• Generates and installs temporal groups in the L2 cache

• Prefetch Engine• Prefetches temporal groups

Page 31: Teaching Old Caches New Tricks: Predictor  Virtualization

31

Branch Stream

TemporalGroup Generator

Temporal Group Generation

BTBPC

Prefetch Engine

L2 Cache

miss

BTB misses generate temporal groupsBTB hits do not generate any PBTB activity

Miss

Hit

Page 32: Teaching Old Caches New Tricks: Predictor  Virtualization

32

Prefetch Engine

miss

Branch Metadata Prefetching

BTBPC

TemporalGroup Generator

VirtualTableL2 Cache

PrefetchBuffer

BTB misses trigger metadata prefetchesParallel lookup in BTB and prefetch buffer

Branch Stream

Miss

Hit

Prefetch Buffer Hits

Page 33: Teaching Old Caches New Tricks: Predictor  Virtualization

33Phantom-BTB Advantages

• “Pay-as-you-go” approach– Practical design– Increases the perceived BTB capacity– Dynamic allocation of resources

• Branch metadata allocated on demand– On-the-fly adaptation to application demands

• Branch metadata generation and retrieval performed on BTB misses• Only if the application sees misses• Metadata survives in the L2 as long as there is sufficient capacity and

demand

Page 34: Teaching Old Caches New Tricks: Predictor  Virtualization

34Experimental Methodology

• Flexus cycle-accurate, full-system simulator• Uniprocessor - OoO

– 1K-entry conventional BTB– 64KB 2-way ICache/DCache– 4MB 16-way L2 Cache

• Phantom-BTB– 64-entry prefetch buffer– 6-entry temporal group – 4K-entry virtual table

• Commercial Workloads

Page 35: Teaching Old Caches New Tricks: Predictor  Virtualization

35PBTB vs. Conventional BTBs

Spee

dup

better

Performance within 1% of a 4K-entry BTB with 3.6x less storage

Page 36: Teaching Old Caches New Tricks: Predictor  Virtualization

36Phantom-BTB with Larger Dedicated BTBs

Spe

edup

better

PBTB remains effective with larger dedicated BTBs

Page 37: Teaching Old Caches New Tricks: Predictor  Virtualization

37Increase in L2 MPKI

L2 M

PK

I

Marginal increase in L2 misses

better

Page 38: Teaching Old Caches New Tricks: Predictor  Virtualization

38Increase in L2 Accesses

L2 A

cces

ses

per K

I

• PBTB follows application demand for BTB capacity

better

Page 39: Teaching Old Caches New Tricks: Predictor  Virtualization

39Summary

• Predictor metadata stored in memory hierarchy– Benefits

• Reduces dedicated predictor resources• Emulates large predictor tables for increased predictor accuracy

– Why now?• Large on-chip caches / CMPs / need for large predictors

– Predictor virtualization advantages• Predictor adaptation• Metadata sharing

• Moving Forward– Virtualize other predictors– Expose predictor interface to software level