apr 9, 2012
DESCRIPTION
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms. Apr 9, 2012 Heechul Yun + , Gang Yao + , Rodolfo Pellizzoni * , Marco Caccamo + , Lui Sha + + University of Illinois at Urbana-Champaign * University of Waterloo. - PowerPoint PPT PresentationTRANSCRIPT
MemGuard: Memory Bandwidth ReservationSystem for Efficient Performance Isolation in
Multi-core Platforms
Apr 9, 2012Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*,
Marco Caccamo+, Lui Sha+
+University of Illinois at Urbana-Champaign*University of Waterloo
Real-Time Applications
2
• Resource intensive real-time applications– Multimedia processing(*), real-time data analytic(**), object tracking
• Requirements– Need more performance and cost less Commercial Off-The Shelf (COTS) – Performance guarantee (i.e., temporal predictability and isolation)
(*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010(**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
Modern System-on-Chip (SoC)
• More cores– Freescale P4080 has 8 cores
• More sharing – Shared memory hierarchy (LLC, MC, DRAM)– Shared I/O channels
3
More performanceLess cost
But, isolation?
Problem: Shared Memory Hierarchy
• Shared hardware resources• OS has little control
Core1 Core2 Core3 Core4
DRAM
Part 1 Part 2 Part 3 Part 4
4
Memory Controller (MC)
Shared Last Level Cache (LLC) Space contention
Access contention
5
Memory Performance Isolation
• Q. How to guarantee worst-case performance?– Need to guarantee memory performance
Part 1 Part 2 Part 3 Part 4
Core1 Core2 Core3 Core4
DRAM
Memory Controller
LLC LLC LLC LLC
Inter-Core Memory Interference
• Significant slowdown (Up to 2x on 2 cores)• Slowdown is not proportional to memory bandwidth usage
6
Core
Shared Memory
foregroundX-axis
Intel Core2
L2 L2
437.leslie3d 462.libquantum 410.bwaves 471.omnetpp1.0
1.2
1.4
1.6
1.8
2.0
2.2
Slowdown ratio due to interference
(1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s)
Core
background470.lbm
Runti
me
slow
dow
n
Core
Bank 4
Background: DRAM Chip
Row 1Row 2Row 3Row 4Row 5
Bank 1
Row Buffer
Bank 2Bank 3
activate
precharge
Read/write • State dependent access latency– Row miss: 19 cycles, Row hit: 9 cycles
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
READ (Bank 1, Row 3, Col 7)
Col7
Background: Memory Controller(MC)
• Request queue– Buffer read/write requests from CPU cores– Unpredictable queuing delay due to reordering
8
Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.
Background: MC Queue Re-ordering
• Improve row hit ratio and throughput• Unpredictable queuing delay
9
Core1: READ Row 1, Col 1Core2: READ Row 2, Col 1Core1: READ Row 1, Col 2
Core1: READ Row 1, Col 1Core1: READ Row 1, Col 2Core2: READ Row 2, Col 1
DRAM DRAM
Initial Queue Reordered Queue
2 Row Switch 1 Row Switch
Challenges for Real-Time Systems• Memory controller(MC) level queuing delay– Main source of interference– Unpredictable (re-ordering)
• DRAM state dependent latency– DRAM row open/close state
• State of Art– Predictable DRAM controller h/w: [Akesson’07] [Paolieri’09]
[Reineke’11] not in COTS
10
Core1
Core2
Core3
Core4
DRAM
Memory ControllerMemory Controller
DRAM
Predictable Memory Controller
Our Approach
• OS level approach– Works on commodity h/w
– Guarantees performance of each core
– Maximizes memory performance if possible
11
Core1
Core2
Core3
Core4
DRAM
Memory ControllerMemory Controller
DRAM
OS control mechanism
12
Operating System
MemGuard
Core1 Core2 Core3 Core4PMC PMC PMC PMC
DRAM DIMM
MemGuard
Multicore ProcessorMemory Controller
• Memory bandwidth reservation and reclaiming
BWRegulator
BWRegulator
BWRegulator
BWRegulator0.9GB/s 0.1GB/s 0.1GB/s 0.1GB/s
Reclaim Manager
13
Memory Bandwidth Reservation• Idea
– OS monitor and enforce each core’s memory bandwidth usage
1ms 2ms0Dequeue tasks
Enqueue tasks
Dequeue tasks
Budget
Coreactivity
21
computation memory fetch
Memory Bandwidth Reservation
• Key Insight– B/W regulators control memory request rates – (Cores)request rate (DRAM) service rate (Memory
controller) minimal queuing delay
• System-wide reservation rule– up to the guaranteed bandwidth rmin
• m: #of cores• Bi: Core i’s b/w reservation
14
Guaranteed Bandwidth: rmin • Worst-case DRAM performance (service rate)
– All memory requests go to the same bank (no bank-level parallelism) and cause row miss
• Example (PC6400-DDR2*)– Peak B/W: 6.4GB/s
• 64bytes I/O = 10ns, hide command latency by interleaving– Calculated guaranteed B/W: 1.3GB/s
• PRE + ACT + RD + I/O (8x8bytes) = 47.5ns– Measured guaranteed B/W: 1.2GB/s
• Performance Isolation– Sum of memory b/w reservation guaranteed b/w
15(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
Memory Access Pattern
• Memory access patterns vary over time• Static reservation is inefficient
16
Time(ms)
Memory requests
Time(ms)
Memory requests
Memory Bandwidth Reclaiming
• Key objective– Redistribute excess bandwidth to demanding
cores– Improve memory b/w utilization
• Predictive bandwidth donation and reclaiming– Donate unneeded budget predictively– Reclaim on-demand basis
17
Reclaim Example• Time 0
– Initial budget 3 for both cores
• Time 3,4– Decrement budgets
• Time 10 (period 1)– Predictive donation (total 4)
• Time 12,15– Core 1: reclaim
• Time 16– Core 0: reclaim
• Time 17– Core 1: no remaining budget; dequeue tasks
• Time 20 (period 2)– Core 0 donates 2– Core 1 do not donate
18
Best-effort Bandwidth Sharing
• Key objective– Utilize best-effort bandwidth whenever possible
• Best-effort bandwidth– After all cores use their budgets (i.e., delivering guaranteed
bandwidth), before the next period begins
• Sharing policy– Maximize throughput– Broadcast all cores to continue to execute
19
Evaluation Platform
• Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM– Prefetchers were turned off for evaluation– Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported
• Modified Linux kernel 3.6.0 + MemGuard kernel module– https://github.com/heechul/memguard/wiki/MemGuard
• Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks
20
Core 0
I D
L2 Cache
Intel Core2Quad
Core 1
I D
Core 2
I D
L2 Cache
Core 3
I D
Memory Controller
DRAM
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.libquantum(foreground)
memory hogs(background)
run-alone co-run run-alone co-run run-alone co-runw/o Memguard Memguard
(reservation only)Memguard
(reclaim+share)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Nor
mal
ized
IPC
Foreground (462.libquantum)
C2
53% performance reduction
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.Libquantum(foreground)
run-alone co-run run-alone co-run run-alone co-runw/o Memguard Memguard
(reservation only)Memguard
(reclaim+share)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Nor
mal
ized
IPC
Foreground (462.libquantum)
1GB/s
memory hogs(background)
C2
.2GB/s
Reservation provides performance isolation
Guaranteed performance
Evaluation Results
C0
Shared Memory
C2
Intel Core2
L2 L2
462.Libquantum(foreground)
run-alone co-run run-alone co-run run-alone co-runw/o Memguard Memguard
(reservation only)Memguard
(reclaim+share)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Nor
mal
ized
IPC
Foreground (462.libquantum)
1GB/s
memory hogs(background)
C2
.2GB/s
Reclaiming and Sharing maximize performance
Guaranteed performance
SPEC2006 Results
24
470.lbm
437.leslie
3d
462.libquantum
410.bwaves
471.omnetpp
459.GemsFD
TD
482.sphinx3
429.mcf
450.soplex
433.milc
434.zeusm
p
483.xalancb
mk
436.cactu
sADM
403.gcc
456.hmmer
473.astar
401.bzip2
400.perlbench
447.dealII
454.calcu
lix
464.h264ref
445.gobmk
458.sjeng
435.gromacs
481.wrf
444.namd
465.tonto
416.gamess
453.povray
geomean0.0
0.5
1.0
1.5
2.0
2.5
3.0
foreground (X-axis) @1.0GB/s
Nor
mal
ized
IPC
Normalized to guaranteed performance
• Guarantee(soft) performance of foreground (x-axis)– W.r.t. 1.0GB/s memory b/w reservation
C0
Shared Memory
C2
Intel Core2
L2 L2
X-axis(foreground)
1GB/s
470.lbm(background)
C2.2GB/s
SPEC2006 Results
• Improve overall throughput– background: 368%, foreground(X-axis): 6%
25
470.lbm
437.leslie
3d
462.libquantum
410.bwaves
471.omnetpp
459.GemsFD
TD
482.sphinx3
429.mcf
450.soplex
433.milc
434.zeusm
p
483.xalancb
mk
436.cactu
sADM
403.gcc
456.hmmer
473.astar
401.bzip2
400.perlbench
447.dealII
454.calcu
lix
464.h264ref
445.gobmk
458.sjeng
435.gromacs
481.wrf
444.namd
465.tonto
416.gamess
453.povray
geomean0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
foreground (X-axis) @1.0GB/s background (470.lbm) @0.2GB/s
Nor
mal
ized
IPC Normalized to guaranteed performance
C0
Shared Memory
C2
Intel Core2
L2 L2
X-axis(foreground)
1GB/s
470.lbm(background)
C2.2GB/s
Conclusion
• Inter-Core Memory Interference – Big challenge for multi-core based real-time systems– Sources: queuing delay in MC, state dependent latency in
DRAM
• MemGuard– OS mechanism providing efficient per-core memory
performance isolation on COTS H/W– Memory bandwidth reservation and reclaiming support– https://github.com/heechul/memguard/wiki/MemGuard
26
Thank you.
27
Effect of Reclaim
• IPC improvement of background ([email protected]/s) is 3.8x• IPC reduction of foreground ([email protected]/s) is 3%
28
Reclaim Underrun Error
29
Effect of Spare Sharing
• IPC of background ([email protected]/s) improves 40%• IPC of foreground ([email protected]/s) also improves 9%
30
Isolation and Throughput Effect of rmin
• 4 core configuration
31
Isolation Effect of Reservation
• Sum b/w reservation < rmin (1.2GB/s) Isolation– 1.0GB/s(X-axis) + 0.2GB/s(lbm) = rmin
32
Isolation
Core 0: 1.0 GB/s for X-axis
Core 2: 0.2 – 2.0 GB/s for lbm
Solo [email protected]/s
Effect of MemGuard
• Soft real-time application on each core. • Provides differentiated memory bandwidth
– weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled
33
Hard/Soft Reservation on MemGuard
• Hard reservation (w/o reclaiming)– Can guarantee memory bandwidth Bi regardless of other cores
at each period– Wasted if not used
• Soft reservation (w/ reclaiming)– Misprediction can caused missed b/w guarantee at each period– Error rate is small---less than 5%.
• Selectively applicable on per-core basis
34