meeting midway: improving cmp performance with memory-side prefetching

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das and Anand Sivasubramaniam

The Pennsylvania State University

Summary

• In modern multi-core systems, increasing number of cores share common resources– “Memory Wall”

• Application/Core contention Interference

• Average 10% improvement in application performance

ProposalA novel memory-side prefetching scheme

Mitigates interference while exploiting row buffer locality

Outline

• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion

Network On-Chip based CMP

RequestMessage

ResponseMessage

MC0

MC2 MC3

MC1

L2

L1

L1

C

L2

R

Memory Controller

Bank 0F21 G12 C41 B5 H22 B4

Bank 1

A

MC

B

Row Buffer Conflict

Precharge row AActivate row B

Row Buffer Hit

B

DRAMCPU

B4

Outline


Impact of Interference

bzip2 GemsFDTD lbm libquantum mcf milc sphinx3 zeusmp0

10

20

30

40

50

60

70

80

90

100

Individual Mix-8

Row

Buff

er H

it Ra

te

How to handle this negative interference?

Latency Breakdown of L2 Miss

18%

60%

22%

High MPKI

35%

19%

46%

Moderate MPKI

43%

8%

49%

Low MPKI

Off-chip latency is the majority part

On-chip Off- chip Queueing Off- chip Access

Observations

• Memory requests from multiple cores interleave at the memory controllers– Row buffer locality of individual apps is lost

• Off-chip latency is the majority part in a memory access

• On-chip network and caches are critical– Cannot afford to pollute them

What about Cache Prefetching?• Not effective for large CMPs

• Agnostic to memory state– Gap between caches and memory (62% latency increase)

• On-chip resource pollution– Both caches and network (22% network latency increase)

• Difficulty of stream detection in S-NUCA– Each L2 bank caters to only a portion of the address space– Each L2 bank gets requests from multiple L1s

• Our memory-side prefetching scheme can work along with core-side prefetching

Outline


Memory-Side Prefetching

• Objective 1– Reduce off-chip access latency

• Objective 2– With out increasing on-chip resource contention

Memory-Side Prefetching

What to Prefetch? When to Prefetch?

Where to Prefetch?

What to Prefetch?

• Prefetch from an open row – Minimizes overhead

• Looked at the line access patterns within a row

What to Prefetch?

Line 0

Line 4Lin

e 8

Line 12

Line 16

Line 20

Line 24

Line 28

Line 32

Line 36

Line 40

Line 44

Line 48

Line 52

Line 56

Line 60

05

1015202530354045

Line 0Line 11

Line 22Line 33

Line 44Line 55

milc

% o

f Acc

esse

s

What to Prefetch?

Line 0

Line 8

Line 16

Line 24

Line 32

Line 40

Line 48

Line 56

0

20

40

60

80

100

Line 0Line 16

Line 32Line 48

libquantum

% o

f Acc

esse

s

Line 0

Line 8

Line 16

Line 24

Line 32

Line 40

Line 48

Line 56

0

4

8

12

16

20

Line 0Line 15

Line 30Line 45

Line 60

omnetpp

% o

f Acc

esse

s

In general, next-line locality is good

When to Prefetch?Critical Path Locality # of Prefetches

Prefetch at RBC Yes No High

Prefetch at RBH No Yes Low

Prefetch at Row ACT

No No High

Prefetch at Idle No Yes High

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 3133+

0

500000

1000000

Cycles

Idle

Per

iods

5618579

Where to Prefetch?

• Should be stored on-chip

• Prefetch buffers in the memory controllers– To avoid on-chip resource pollution

• Organization– Per-core– Shared

Memory-Side Prefetching Optimizations

• Applications vary in memory behavior

• Prefetch Throttling– Feedback

• Precharge on Prefetch– Less likely to get a request

• Avert Costly Prefetchets– Waiting demand requests

Memory-Side Prefetching: Example

Bank 0F21 G12 C41 C26 H22 A10

Bank 1

A

MC

B

Core 0

Core 1 C32, C33, C34, C36

Core 2 R12, R13, R14, R15

Core 3 F20, F21, F22, F23

A11, A12, A13, A14A10

Row Buffer Hit

Prefetch from A

A11

A11

DRAMCPU

Memory-Side Prefetching: ComparisonCache Prefetcher[Lui et al. ILP ‘11]

Existing Memory Prefetchers [Lin HPCA ‘01]

Our Memory-side Prefetcher

Memory State Aware

No Yes Yes

On-chip resource pollution

Yes Yes No

Accuracy Yes No Yes

Implementation

• Prefetch Buffer Implementation– Organized as n per-core prefetch buffers– 256 KB per Memory Controller (<3% compared to

LLC)– < 1% Area and Power overhead

• Prefetch Request Timing– Requests are generated internally by the memory

controller along with a read row buffer hit request

Outline


Evaluation Platform

• Cores: 32 at 2.4 GHz• Network: 8x4 2D mesh• Caches: 32KB L1I; 32KB L1D; 1MB L2 per core • Memory: 16GB DDR3-1600 with 4 Memory

Channels

• GEMS simulator with GARNET

Evaluation Methodology

• Benchmarks:– Multi-programmed: SPEC 2006 (WL1 to WL5)– Multi-threaded: SPECOMP 2001 (WL6 & WL7)

• Metrics:– Harmonic IPC– Off-chip and On-chip Latencies

IPC

WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG

-10

-5

0

5

10

15

20

CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP

IPC

Impr

ovem

ent

33.2

10%

Latency

WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0

100

200

300

400

500

600

No Pref CSP MSP IDLE-PUSH CSP+MSP

Cycl

es

-48.5%

Latency


100

200

300

400

500

600

No Pref CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP

Cycl

es

-48.5%

L2 Hitrate


20

40

60

80

100

No Pref CSP MSP CSP+MSP

L2 H

it Ra

te

Row Buffer Hitrate


1020304050607080

No Pref CSP MSP CSP+MSP

Row

Buff

er H

itrat

e

Outline


Conclusion

• Proposed a new memory-side prefetcher– Opportunistic– Instantaneous knowledge of memory state

• Prefetching Midway– Doesn’t pollute on-chip resources

• Reduces the off-chip latency by 48.5% and improves performance by 6.2% on average

• Our technique can be combined with core-side prefetching to amplify the benefits

Thank You

• Questions?

meeting midway: improving cmp performance with memory-side prefetching

Documents

chip prefetch buffers

memory access

chip network

memory controllersto

memory stategap

chip resource contentionmemory

chip access latency

cache prefetching