apr 9, 2012

MemGuard: Memory Bandwidth ReservationSystem for Efficient Performance Isolation in

Multi-core Platforms

Apr 9, 2012Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*,

Marco Caccamo+, Lui Sha+

+University of Illinois at Urbana-Champaign*University of Waterloo

Real-Time Applications

2

• Resource intensive real-time applications– Multimedia processing(*), real-time data analytic(**), object tracking

• Requirements– Need more performance and cost less Commercial Off-The Shelf (COTS) – Performance guarantee (i.e., temporal predictability and isolation)

(*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010(**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012

Modern System-on-Chip (SoC)

• More cores– Freescale P4080 has 8 cores

• More sharing – Shared memory hierarchy (LLC, MC, DRAM)– Shared I/O channels

3

More performanceLess cost

But, isolation?

Problem: Shared Memory Hierarchy

• Shared hardware resources• OS has little control

Core1 Core2 Core3 Core4

DRAM

Part 1 Part 2 Part 3 Part 4

4

Memory Controller (MC)

Shared Last Level Cache (LLC) Space contention

Access contention

5

Memory Performance Isolation

• Q. How to guarantee worst-case performance?– Need to guarantee memory performance

Part 1 Part 2 Part 3 Part 4

Core1 Core2 Core3 Core4

DRAM

Memory Controller

LLC LLC LLC LLC

Inter-Core Memory Interference

• Significant slowdown (Up to 2x on 2 cores)• Slowdown is not proportional to memory bandwidth usage

6

Core

Shared Memory

foregroundX-axis

Intel Core2

L2 L2

437.leslie3d 462.libquantum 410.bwaves 471.omnetpp1.0

1.2

1.4

1.6

1.8

2.0

2.2

Slowdown ratio due to interference

(1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s)

Core

background470.lbm

Runti

me

slow

dow

n

Core

Bank 4

Background: DRAM Chip

Row 1Row 2Row 3Row 4Row 5

Bank 1

Row Buffer

Bank 2Bank 3

activate

precharge

Read/write • State dependent access latency– Row miss: 19 cycles, Row hit: 9 cycles

(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

READ (Bank 1, Row 3, Col 7)

Col7

Background: Memory Controller(MC)

• Request queue– Buffer read/write requests from CPU cores– Unpredictable queuing delay due to reordering

8

Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

Background: MC Queue Re-ordering

• Improve row hit ratio and throughput• Unpredictable queuing delay

9

Core1: READ Row 1, Col 1Core2: READ Row 2, Col 1Core1: READ Row 1, Col 2

Core1: READ Row 1, Col 1Core1: READ Row 1, Col 2Core2: READ Row 2, Col 1

DRAM DRAM

Initial Queue Reordered Queue

2 Row Switch 1 Row Switch

Challenges for Real-Time Systems• Memory controller(MC) level queuing delay– Main source of interference– Unpredictable (re-ordering)

• DRAM state dependent latency– DRAM row open/close state

• State of Art– Predictable DRAM controller h/w: [Akesson’07] [Paolieri’09]

[Reineke’11] not in COTS

10

Core1

Core2

Core3

Core4

DRAM

Memory ControllerMemory Controller

DRAM

Predictable Memory Controller

Our Approach

• OS level approach– Works on commodity h/w

– Guarantees performance of each core

– Maximizes memory performance if possible

11

Core1

Core2

Core3

Core4

DRAM

Memory ControllerMemory Controller

DRAM

OS control mechanism

12

Operating System

MemGuard

Core1 Core2 Core3 Core4PMC PMC PMC PMC

DRAM DIMM

MemGuard

Multicore ProcessorMemory Controller

• Memory bandwidth reservation and reclaiming

BWRegulator

BWRegulator

BWRegulator

BWRegulator0.9GB/s 0.1GB/s 0.1GB/s 0.1GB/s

Reclaim Manager

13

Memory Bandwidth Reservation• Idea

– OS monitor and enforce each core’s memory bandwidth usage

1ms 2ms0Dequeue tasks

Enqueue tasks

Dequeue tasks

Budget

Coreactivity

21

computation memory fetch

Memory Bandwidth Reservation

• Key Insight– B/W regulators control memory request rates – (Cores)request rate (DRAM) service rate (Memory

controller) minimal queuing delay

• System-wide reservation rule– up to the guaranteed bandwidth rmin

• m: #of cores• Bi: Core i’s b/w reservation

14

Guaranteed Bandwidth: rmin • Worst-case DRAM performance (service rate)

– All memory requests go to the same bank (no bank-level parallelism) and cause row miss

• Example (PC6400-DDR2*)– Peak B/W: 6.4GB/s

• 64bytes I/O = 10ns, hide command latency by interleaving– Calculated guaranteed B/W: 1.3GB/s

• PRE + ACT + RD + I/O (8x8bytes) = 47.5ns– Measured guaranteed B/W: 1.2GB/s

• Performance Isolation– Sum of memory b/w reservation guaranteed b/w

15(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

Memory Access Pattern

• Memory access patterns vary over time• Static reservation is inefficient

16

Time(ms)

Memory requests

Time(ms)

Memory requests

Memory Bandwidth Reclaiming

• Key objective– Redistribute excess bandwidth to demanding

cores– Improve memory b/w utilization

• Predictive bandwidth donation and reclaiming– Donate unneeded budget predictively– Reclaim on-demand basis

17

Reclaim Example• Time 0

– Initial budget 3 for both cores

• Time 3,4– Decrement budgets

• Time 10 (period 1)– Predictive donation (total 4)

• Time 12,15– Core 1: reclaim

• Time 16– Core 0: reclaim

• Time 17– Core 1: no remaining budget; dequeue tasks

• Time 20 (period 2)– Core 0 donates 2– Core 1 do not donate

18

Best-effort Bandwidth Sharing

• Key objective– Utilize best-effort bandwidth whenever possible

• Best-effort bandwidth– After all cores use their budgets (i.e., delivering guaranteed

bandwidth), before the next period begins

• Sharing policy– Maximize throughput– Broadcast all cores to continue to execute

19

Evaluation Platform

• Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM– Prefetchers were turned off for evaluation– Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported

• Modified Linux kernel 3.6.0 + MemGuard kernel module– https://github.com/heechul/memguard/wiki/MemGuard

• Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks

20

Core 0

I D

L2 Cache

Intel Core2Quad

Core 1

I D

Core 2

I D

L2 Cache

Core 3

I D

Memory Controller

DRAM

https://github.com/heechul/memguard/wiki/MemGuard



Evaluation Results

C0

Shared Memory

C2

Intel Core2

L2 L2

462.libquantum(foreground)

memory hogs(background)

run-alone co-run run-alone co-run run-alone co-runw/o Memguard Memguard

(reservation only)Memguard

(reclaim+share)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

IPC

Foreground (462.libquantum)

C2

53% performance reduction

Evaluation Results

C0

Shared Memory

C2

Intel Core2

L2 L2

462.Libquantum(foreground)



(reclaim+share)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

IPC


1GB/s


C2

.2GB/s

Reservation provides performance isolation

Guaranteed performance

Evaluation Results

C0

Shared Memory

C2

Intel Core2

L2 L2

462.Libquantum(foreground)



(reclaim+share)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

IPC


1GB/s


C2

.2GB/s

Reclaiming and Sharing maximize performance

Guaranteed performance

SPEC2006 Results

24

470.lbm

437.leslie

3d

462.libquantum

410.bwaves

471.omnetpp

459.GemsFD

TD

482.sphinx3

429.mcf

450.soplex

433.milc

434.zeusm

p

483.xalancb

mk

436.cactu

sADM

403.gcc

456.hmmer

473.astar

401.bzip2

400.perlbench

447.dealII

454.calcu

lix

464.h264ref

445.gobmk

458.sjeng

435.gromacs

481.wrf

444.namd

465.tonto

416.gamess

453.povray

geomean0.0

0.5

1.0

1.5

2.0

2.5

3.0

foreground (X-axis) @1.0GB/s

Nor

mal

ized

IPC

Normalized to guaranteed performance

• Guarantee(soft) performance of foreground (x-axis)– W.r.t. 1.0GB/s memory b/w reservation

C0

Shared Memory

C2

Intel Core2

L2 L2

X-axis(foreground)

1GB/s

470.lbm(background)

C2.2GB/s

SPEC2006 Results

• Improve overall throughput– background: 368%, foreground(X-axis): 6%

25

470.lbm

437.leslie

3d

462.libquantum

410.bwaves

471.omnetpp

459.GemsFD

TD

482.sphinx3

429.mcf

450.soplex

433.milc

434.zeusm

p

483.xalancb

mk

436.cactu

sADM

403.gcc

456.hmmer

473.astar

401.bzip2

400.perlbench

447.dealII

454.calcu

lix

464.h264ref

445.gobmk

458.sjeng

435.gromacs

481.wrf

444.namd

465.tonto

416.gamess

453.povray

geomean0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

foreground (X-axis) @1.0GB/s background (470.lbm) @0.2GB/s

Nor

mal

ized

IPC Normalized to guaranteed performance

C0

Shared Memory

C2

Intel Core2

L2 L2

X-axis(foreground)

1GB/s

470.lbm(background)

C2.2GB/s

Conclusion

• Inter-Core Memory Interference – Big challenge for multi-core based real-time systems– Sources: queuing delay in MC, state dependent latency in

DRAM

• MemGuard– OS mechanism providing efficient per-core memory

performance isolation on COTS H/W– Memory bandwidth reservation and reclaiming support– https://github.com/heechul/memguard/wiki/MemGuard

26



Thank you.

27

Effect of Reclaim

• IPC improvement of background ([email protected]/s) is 3.8x• IPC reduction of foreground ([email protected]/s) is 3%

28

mailto:[email protected]/s


Reclaim Underrun Error

29

Effect of Spare Sharing

• IPC of background ([email protected]/s) improves 40%• IPC of foreground ([email protected]/s) also improves 9%

30



Isolation and Throughput Effect of rmin

• 4 core configuration

31

Isolation Effect of Reservation

• Sum b/w reservation < rmin (1.2GB/s) Isolation– 1.0GB/s(X-axis) + 0.2GB/s(lbm) = rmin

32

Isolation

Core 0: 1.0 GB/s for X-axis

Core 2: 0.2 – 2.0 GB/s for lbm

Solo [email protected]/s

Effect of MemGuard

• Soft real-time application on each core. • Provides differentiated memory bandwidth

– weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled

33

Hard/Soft Reservation on MemGuard

• Hard reservation (w/o reclaiming)– Can guarantee memory bandwidth Bi regardless of other cores

at each period– Wasted if not used

• Soft reservation (w/ reclaiming)– Misprediction can caused missed b/w guarantee at each period– Error rate is small---less than 5%.

• Selectively applicable on per-core basis

34

apr 9, 2012

Documents