meeting midway: improving cmp performance with memory-side prefetching
DESCRIPTION
Meeting Midway: Improving CMP Performance with Memory-Side Prefetching. Praveen Yedlapalli , Jagadish Kotra , Emre Kultursay , Mahmut Kandemir , Chita R. Das and Anand Sivasubramaniam The Pennsylvania State University. Summary. - PowerPoint PPT PresentationTRANSCRIPT
Meeting Midway: Improving CMP Performance with Memory-Side Prefetching
Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das and Anand Sivasubramaniam
The Pennsylvania State University
Summary
• In modern multi-core systems, increasing number of cores share common resources– “Memory Wall”
• Application/Core contention Interference
• Average 10% improvement in application performance
ProposalA novel memory-side prefetching scheme
Mitigates interference while exploiting row buffer locality
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
Network On-Chip based CMP
RequestMessage
ResponseMessage
MC0
MC2 MC3
MC1
L2
L1
L1
C
L2
R
Memory Controller
Bank 0F21 G12 C41 B5 H22 B4
Bank 1
A
MC
B
Row Buffer Conflict
Precharge row AActivate row B
Row Buffer Hit
B
DRAMCPU
B4
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
Impact of Interference
bzip2 GemsFDTD lbm libquantum mcf milc sphinx3 zeusmp0
10
20
30
40
50
60
70
80
90
100
Individual Mix-8
Row
Buff
er H
it Ra
te
How to handle this negative interference?
Latency Breakdown of L2 Miss
18%
60%
22%
High MPKI
35%
19%
46%
Moderate MPKI
43%
8%
49%
Low MPKI
Off-chip latency is the majority part
On-chip Off- chip Queueing Off- chip Access
Observations
• Memory requests from multiple cores interleave at the memory controllers– Row buffer locality of individual apps is lost
• Off-chip latency is the majority part in a memory access
• On-chip network and caches are critical– Cannot afford to pollute them
What about Cache Prefetching?• Not effective for large CMPs
• Agnostic to memory state– Gap between caches and memory (62% latency increase)
• On-chip resource pollution– Both caches and network (22% network latency increase)
• Difficulty of stream detection in S-NUCA– Each L2 bank caters to only a portion of the address space– Each L2 bank gets requests from multiple L1s
• Our memory-side prefetching scheme can work along with core-side prefetching
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
Memory-Side Prefetching
• Objective 1– Reduce off-chip access latency
• Objective 2– With out increasing on-chip resource contention
Memory-Side Prefetching
What to Prefetch? When to Prefetch?
Where to Prefetch?
What to Prefetch?
• Prefetch from an open row – Minimizes overhead
• Looked at the line access patterns within a row
What to Prefetch?
Line 0
Line 4Lin
e 8
Line 12
Line 16
Line 20
Line 24
Line 28
Line 32
Line 36
Line 40
Line 44
Line 48
Line 52
Line 56
Line 60
05
1015202530354045
Line 0Line 11
Line 22Line 33
Line 44Line 55
milc
% o
f Acc
esse
s
What to Prefetch?
Line 0
Line 8
Line 16
Line 24
Line 32
Line 40
Line 48
Line 56
0
20
40
60
80
100
Line 0Line 16
Line 32Line 48
libquantum
% o
f Acc
esse
s
Line 0
Line 8
Line 16
Line 24
Line 32
Line 40
Line 48
Line 56
0
4
8
12
16
20
Line 0Line 15
Line 30Line 45
Line 60
omnetpp
% o
f Acc
esse
s
In general, next-line locality is good
When to Prefetch?Critical Path Locality # of Prefetches
Prefetch at RBC Yes No High
Prefetch at RBH No Yes Low
Prefetch at Row ACT
No No High
Prefetch at Idle No Yes High
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 3133+
0
500000
1000000
Cycles
Idle
Per
iods
5618579
Where to Prefetch?
• Should be stored on-chip
• Prefetch buffers in the memory controllers– To avoid on-chip resource pollution
• Organization– Per-core– Shared
Memory-Side Prefetching Optimizations
• Applications vary in memory behavior
• Prefetch Throttling– Feedback
• Precharge on Prefetch– Less likely to get a request
• Avert Costly Prefetchets– Waiting demand requests
Memory-Side Prefetching: Example
Bank 0F21 G12 C41 C26 H22 A10
Bank 1
A
MC
B
Core 0
Core 1 C32, C33, C34, C36
Core 2 R12, R13, R14, R15
Core 3 F20, F21, F22, F23
A11, A12, A13, A14A10
Row Buffer Hit
Prefetch from A
A11
A11
DRAMCPU
Memory-Side Prefetching: ComparisonCache Prefetcher[Lui et al. ILP ‘11]
Existing Memory Prefetchers [Lin HPCA ‘01]
Our Memory-side Prefetcher
Memory State Aware
No Yes Yes
On-chip resource pollution
Yes Yes No
Accuracy Yes No Yes
Implementation
• Prefetch Buffer Implementation– Organized as n per-core prefetch buffers– 256 KB per Memory Controller (<3% compared to
LLC)– < 1% Area and Power overhead
• Prefetch Request Timing– Requests are generated internally by the memory
controller along with a read row buffer hit request
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
Evaluation Platform
• Cores: 32 at 2.4 GHz• Network: 8x4 2D mesh• Caches: 32KB L1I; 32KB L1D; 1MB L2 per core • Memory: 16GB DDR3-1600 with 4 Memory
Channels
• GEMS simulator with GARNET
Evaluation Methodology
• Benchmarks:– Multi-programmed: SPEC 2006 (WL1 to WL5)– Multi-threaded: SPECOMP 2001 (WL6 & WL7)
• Metrics:– Harmonic IPC– Off-chip and On-chip Latencies
IPC
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG
-10
-5
0
5
10
15
20
CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP
IPC
Impr
ovem
ent
33.2
10%
Latency
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
100
200
300
400
500
600
No Pref CSP MSP IDLE-PUSH CSP+MSP
Cycl
es
-48.5%
Latency
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
100
200
300
400
500
600
No Pref CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP
Cycl
es
-48.5%
L2 Hitrate
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
20
40
60
80
100
No Pref CSP MSP CSP+MSP
L2 H
it Ra
te
Row Buffer Hitrate
WL1 WL2 WL3 WL4 WL5 WL6 WL7 AVG0
1020304050607080
No Pref CSP MSP CSP+MSP
Row
Buff
er H
itrat
e
Outline
• Background• Motivation• Memory-Side Prefetching• Evaluation• Conclusion
Conclusion
• Proposed a new memory-side prefetcher– Opportunistic– Instantaneous knowledge of memory state
• Prefetching Midway– Doesn’t pollute on-chip resources
• Reduces the off-chip latency by 48.5% and improves performance by 6.2% on average
• Our technique can be combined with core-side prefetching to amplify the benefits
Thank You
• Questions?