meeting midway: improving cmp performance with memory-side prefetching

33
Meeting Midway: Improving CMP Performance with Memory-Side Prefetching Praveen Yedlapalli , Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das and Anand Sivasubramaniam The Pennsylvania State University

Upload: jimbo

Post on 22-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching. Praveen Yedlapalli , Jagadish Kotra , Emre Kultursay , Mahmut Kandemir , Chita R. Das and Anand Sivasubramaniam The Pennsylvania State University. Summary. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das and Anand Sivasubramaniam

The Pennsylvania State University

Page 2: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Summary

• In modern multi-core systems, increasing number of cores share common resources– “Memory Wall”

• Application/Core contention Interference

• Average 10% improvement in application performance

ProposalA novel memory-side prefetching scheme

Mitigates interference while exploiting row buffer locality

Page 3: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Outline

• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion

Page 4: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Network On-Chip based CMP

RequestMessage

ResponseMessage

MC0

MC2 MC3

MC1

L2

L1

L1

C

L2

R

Page 5: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Memory Controller

Bank 0F21 G12 C41 B5 H22 B4

Bank 1

A

MC

B

Row Buffer Conflict

Precharge row AActivate row B

Row Buffer Hit

B

DRAMCPU

B4

Page 6: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Outline

• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion

Page 7: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Impact of Interference

bzip2 GemsFDTD lbm libquantum mcf milc sphinx3 zeusmp0

10

20

30

40

50

60

70

80

90

100

Individual Mix-8

Row

Buff

er H

it Ra

te

How to handle this negative interference?

Page 8: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Latency Breakdown of L2 Miss

18%

60%

22%

High MPKI

35%

19%

46%

Moderate MPKI

43%

8%

49%

Low MPKI

Off-chip latency is the majority part

On-chip Off- chip Queueing Off- chip Access

Page 9: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Observations

• Memory requests from multiple cores interleave at the memory controllers– Row buffer locality of individual apps is lost

• Off-chip latency is the majority part in a memory access

• On-chip network and caches are critical– Cannot afford to pollute them

Page 10: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

What about Cache Prefetching?• Not effective for large CMPs

• Agnostic to memory state– Gap between caches and memory (62% latency increase)

• On-chip resource pollution– Both caches and network (22% network latency increase)

• Difficulty of stream detection in S-NUCA– Each L2 bank caters to only a portion of the address space– Each L2 bank gets requests from multiple L1s

• Our memory-side prefetching scheme can work along with core-side prefetching

Page 11: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Outline

• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion

Page 12: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Memory-Side Prefetching

• Objective 1– Reduce off-chip access latency

• Objective 2– With out increasing on-chip resource contention

Page 13: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Memory-Side Prefetching

What to Prefetch? When to Prefetch?

Where to Prefetch?

Page 14: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

What to Prefetch?

• Prefetch from an open row – Minimizes overhead

• Looked at the line access patterns within a row

Page 15: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

What to Prefetch?

Line 0

Line 4Lin

e 8

Line 12

Line 16

Line 20

Line 24

Line 28

Line 32

Line 36

Line 40

Line 44

Line 48

Line 52

Line 56

Line 60

05

1015202530354045

Line 0Line 11

Line 22Line 33

Line 44Line 55

milc

% o

f Acc

esse

s

Page 16: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

What to Prefetch?

Line 0

Line 8

Line 16

Line 24

Line 32

Line 40

Line 48

Line 56

0

20

40

60

80

100

Line 0Line 16

Line 32Line 48

libquantum

% o

f Acc

esse

s

Line 0

Line 8

Line 16

Line 24

Line 32

Line 40

Line 48

Line 56

0

4

8

12

16

20

Line 0Line 15

Line 30Line 45

Line 60

omnetpp

% o

f Acc

esse

s

In general, next-line locality is good

Page 17: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

When to Prefetch?Critical Path Locality # of Prefetches

Prefetch at RBC Yes No High

Prefetch at RBH No Yes Low

Prefetch at Row ACT

No No High

Prefetch at Idle No Yes High

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 3133+

0

500000

1000000

Cycles

Idle

Per

iods

5618579

Page 18: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Where to Prefetch?

• Should be stored on-chip

• Prefetch buffers in the memory controllers– To avoid on-chip resource pollution

• Organization– Per-core– Shared

Page 19: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Memory-Side Prefetching Optimizations

• Applications vary in memory behavior

• Prefetch Throttling– Feedback

• Precharge on Prefetch– Less likely to get a request

• Avert Costly Prefetchets– Waiting demand requests

Page 20: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Memory-Side Prefetching: Example

Bank 0F21 G12 C41 C26 H22 A10

Bank 1

A

MC

B

Core 0

Core 1 C32, C33, C34, C36

Core 2 R12, R13, R14, R15

Core 3 F20, F21, F22, F23

A11, A12, A13, A14A10

Row Buffer Hit

Prefetch from A

A11

A11

DRAMCPU

Page 21: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Memory-Side Prefetching: ComparisonCache Prefetcher[Lui et al. ILP ‘11]

Existing Memory Prefetchers [Lin HPCA ‘01]

Our Memory-side Prefetcher

Memory State Aware

No Yes Yes

On-chip resource pollution

Yes Yes No

Accuracy Yes No Yes

Page 22: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Implementation

• Prefetch Buffer Implementation– Organized as n per-core prefetch buffers– 256 KB per Memory Controller (<3% compared to

LLC)– < 1% Area and Power overhead

• Prefetch Request Timing– Requests are generated internally by the memory

controller along with a read row buffer hit request

Page 23: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Outline

• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion

Page 24: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Evaluation Platform

• Cores: 32 at 2.4 GHz• Network: 8x4 2D mesh• Caches: 32KB L1I; 32KB L1D; 1MB L2 per core • Memory: 16GB DDR3-1600 with 4 Memory

Channels

• GEMS simulator with GARNET

Page 25: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Evaluation Methodology

• Benchmarks:– Multi-programmed: SPEC 2006 (WL1 to WL5)– Multi-threaded: SPECOMP 2001 (WL6 & WL7)

• Metrics:– Harmonic IPC– Off-chip and On-chip Latencies

Page 26: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

IPC

WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG

-10

-5

0

5

10

15

20

CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP

IPC

Impr

ovem

ent

33.2

10%

Page 27: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Latency

WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0

100

200

300

400

500

600

No Pref CSP MSP IDLE-PUSH CSP+MSP

Cycl

es

-48.5%

Page 28: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Latency

WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0

100

200

300

400

500

600

No Pref CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP

Cycl

es

-48.5%

Page 29: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

L2 Hitrate

WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0

20

40

60

80

100

No Pref CSP MSP CSP+MSP

L2 H

it Ra

te

Page 30: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Row Buffer Hitrate

WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0

1020304050607080

No Pref CSP MSP CSP+MSP

Row

Buff

er H

itrat

e

Page 31: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Outline

• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion

Page 32: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Conclusion

• Proposed a new memory-side prefetcher– Opportunistic– Instantaneous knowledge of memory state

• Prefetching Midway– Doesn’t pollute on-chip resources

• Reduces the off-chip latency by 48.5% and improves performance by 6.2% on average

• Our technique can be combined with core-side prefetching to amplify the benefits

Page 33: Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Thank You

• Questions?