apr 9, 2012

34
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun + , Gang Yao + , Rodolfo Pellizzoni * , Marco Caccamo + , Lui Sha + + University of Illinois at Urbana-Champaign * University of Waterloo

Upload: cosima

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms. Apr 9, 2012 Heechul Yun + , Gang Yao + , Rodolfo Pellizzoni * , Marco Caccamo + , Lui Sha + + University of Illinois at Urbana-Champaign * University of Waterloo. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Apr 9, 2012

MemGuard: Memory Bandwidth ReservationSystem for Efficient Performance Isolation in

Multi-core Platforms

Apr 9, 2012Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*,

Marco Caccamo+, Lui Sha+

+University of Illinois at Urbana-Champaign*University of Waterloo

Page 2: Apr 9, 2012

Real-Time Applications

2

• Resource intensive real-time applications– Multimedia processing(*), real-time data analytic(**), object tracking

• Requirements– Need more performance and cost less Commercial Off-The Shelf (COTS) – Performance guarantee (i.e., temporal predictability and isolation)

(*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010(**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012

Page 3: Apr 9, 2012

Modern System-on-Chip (SoC)

• More cores– Freescale P4080 has 8 cores

• More sharing – Shared memory hierarchy (LLC, MC, DRAM)– Shared I/O channels

3

More performanceLess cost

But, isolation?

Page 4: Apr 9, 2012

Problem: Shared Memory Hierarchy

• Shared hardware resources• OS has little control

Core1 Core2 Core3 Core4

DRAM

Part 1 Part 2 Part 3 Part 4

4

Memory Controller (MC)

Shared Last Level Cache (LLC) Space contention

Access contention

Page 5: Apr 9, 2012

5

Memory Performance Isolation

• Q. How to guarantee worst-case performance?– Need to guarantee memory performance

Part 1 Part 2 Part 3 Part 4

Core1 Core2 Core3 Core4

DRAM

Memory Controller

LLC LLC LLC LLC

Page 6: Apr 9, 2012

Inter-Core Memory Interference

• Significant slowdown (Up to 2x on 2 cores)• Slowdown is not proportional to memory bandwidth usage

6

Core

Shared Memory

foregroundX-axis

Intel Core2

L2 L2

437.leslie3d 462.libquantum 410.bwaves 471.omnetpp1.0

1.2

1.4

1.6

1.8

2.0

2.2

Slowdown ratio due to interference

(1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s)

Core

background470.lbm

Runti

me

slow

dow

n

Core

Page 7: Apr 9, 2012

Bank 4

Background: DRAM Chip

Row 1Row 2Row 3Row 4Row 5

Bank 1

Row Buffer

Bank 2Bank 3

activate

precharge

Read/write • State dependent access latency– Row miss: 19 cycles, Row hit: 9 cycles

(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

READ (Bank 1, Row 3, Col 7)

Col7

Page 8: Apr 9, 2012

Background: Memory Controller(MC)

• Request queue– Buffer read/write requests from CPU cores– Unpredictable queuing delay due to reordering

8

Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

Page 9: Apr 9, 2012

Background: MC Queue Re-ordering

• Improve row hit ratio and throughput• Unpredictable queuing delay

9

Core1: READ Row 1, Col 1Core2: READ Row 2, Col 1Core1: READ Row 1, Col 2

Core1: READ Row 1, Col 1Core1: READ Row 1, Col 2Core2: READ Row 2, Col 1

DRAM DRAM

Initial Queue Reordered Queue

2 Row Switch 1 Row Switch

Page 10: Apr 9, 2012

Challenges for Real-Time Systems• Memory controller(MC) level queuing delay– Main source of interference– Unpredictable (re-ordering)

• DRAM state dependent latency– DRAM row open/close state

• State of Art– Predictable DRAM controller h/w: [Akesson’07] [Paolieri’09]

[Reineke’11] not in COTS

10

Core1

Core2

Core3

Core4

DRAM

Memory ControllerMemory Controller

DRAM

Predictable Memory Controller

Page 11: Apr 9, 2012

Our Approach

• OS level approach– Works on commodity h/w

– Guarantees performance of each core

– Maximizes memory performance if possible

11

Core1

Core2

Core3

Core4

DRAM

Memory ControllerMemory Controller

DRAM

OS control mechanism

Page 12: Apr 9, 2012

12

Operating System

MemGuard

Core1 Core2 Core3 Core4PMC PMC PMC PMC

DRAM DIMM

MemGuard

Multicore ProcessorMemory Controller

• Memory bandwidth reservation and reclaiming

BWRegulator

BWRegulator

BWRegulator

BWRegulator0.9GB/s 0.1GB/s 0.1GB/s 0.1GB/s

Reclaim Manager

Page 13: Apr 9, 2012

13

Memory Bandwidth Reservation• Idea

– OS monitor and enforce each core’s memory bandwidth usage

1ms 2ms0Dequeue tasks

Enqueue tasks

Dequeue tasks

Budget

Coreactivity

21

computation memory fetch

Page 14: Apr 9, 2012

Memory Bandwidth Reservation

• Key Insight– B/W regulators control memory request rates – (Cores)request rate (DRAM) service rate (Memory

controller) minimal queuing delay

• System-wide reservation rule– up to the guaranteed bandwidth rmin

• m: #of cores• Bi: Core i’s b/w reservation

14

Page 15: Apr 9, 2012

Guaranteed Bandwidth: rmin • Worst-case DRAM performance (service rate)

– All memory requests go to the same bank (no bank-level parallelism) and cause row miss

• Example (PC6400-DDR2*)– Peak B/W: 6.4GB/s

• 64bytes I/O = 10ns, hide command latency by interleaving– Calculated guaranteed B/W: 1.3GB/s

• PRE + ACT + RD + I/O (8x8bytes) = 47.5ns– Measured guaranteed B/W: 1.2GB/s

• Performance Isolation– Sum of memory b/w reservation guaranteed b/w

15(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

Page 16: Apr 9, 2012

Memory Access Pattern

• Memory access patterns vary over time• Static reservation is inefficient

16

Time(ms)

Memory requests

Time(ms)

Memory requests

Page 17: Apr 9, 2012

Memory Bandwidth Reclaiming

• Key objective– Redistribute excess bandwidth to demanding

cores– Improve memory b/w utilization

• Predictive bandwidth donation and reclaiming– Donate unneeded budget predictively– Reclaim on-demand basis

17

Page 18: Apr 9, 2012

Reclaim Example• Time 0

– Initial budget 3 for both cores

• Time 3,4– Decrement budgets

• Time 10 (period 1)– Predictive donation (total 4)

• Time 12,15– Core 1: reclaim

• Time 16– Core 0: reclaim

• Time 17– Core 1: no remaining budget; dequeue tasks

• Time 20 (period 2)– Core 0 donates 2– Core 1 do not donate

18

Page 19: Apr 9, 2012

Best-effort Bandwidth Sharing

• Key objective– Utilize best-effort bandwidth whenever possible

• Best-effort bandwidth– After all cores use their budgets (i.e., delivering guaranteed

bandwidth), before the next period begins

• Sharing policy– Maximize throughput– Broadcast all cores to continue to execute

19

Page 20: Apr 9, 2012

Evaluation Platform

• Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM– Prefetchers were turned off for evaluation– Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported

• Modified Linux kernel 3.6.0 + MemGuard kernel module– https://github.com/heechul/memguard/wiki/MemGuard

• Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks

20

Core 0

I D

L2 Cache

Intel Core2Quad

Core 1

I D

Core 2

I D

L2 Cache

Core 3

I D

Memory Controller

DRAM

Page 21: Apr 9, 2012

Evaluation Results

C0

Shared Memory

C2

Intel Core2

L2 L2

462.libquantum(foreground)

memory hogs(background)

run-alone co-run run-alone co-run run-alone co-runw/o Memguard Memguard

(reservation only)Memguard

(reclaim+share)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

IPC

Foreground (462.libquantum)

C2

53% performance reduction

Page 22: Apr 9, 2012

Evaluation Results

C0

Shared Memory

C2

Intel Core2

L2 L2

462.Libquantum(foreground)

run-alone co-run run-alone co-run run-alone co-runw/o Memguard Memguard

(reservation only)Memguard

(reclaim+share)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

IPC

Foreground (462.libquantum)

1GB/s

memory hogs(background)

C2

.2GB/s

Reservation provides performance isolation

Guaranteed performance

Page 23: Apr 9, 2012

Evaluation Results

C0

Shared Memory

C2

Intel Core2

L2 L2

462.Libquantum(foreground)

run-alone co-run run-alone co-run run-alone co-runw/o Memguard Memguard

(reservation only)Memguard

(reclaim+share)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

IPC

Foreground (462.libquantum)

1GB/s

memory hogs(background)

C2

.2GB/s

Reclaiming and Sharing maximize performance

Guaranteed performance

Page 24: Apr 9, 2012

SPEC2006 Results

24

470.lbm

437.leslie

3d

462.libquantum

410.bwaves

471.omnetpp

459.GemsFD

TD

482.sphinx3

429.mcf

450.soplex

433.milc

434.zeusm

p

483.xalancb

mk

436.cactu

sADM

403.gcc

456.hmmer

473.astar

401.bzip2

400.perlbench

447.dealII

454.calcu

lix

464.h264ref

445.gobmk

458.sjeng

435.gromacs

481.wrf

444.namd

465.tonto

416.gamess

453.povray

geomean0.0

0.5

1.0

1.5

2.0

2.5

3.0

foreground (X-axis) @1.0GB/s

Nor

mal

ized

IPC

Normalized to guaranteed performance

• Guarantee(soft) performance of foreground (x-axis)– W.r.t. 1.0GB/s memory b/w reservation

C0

Shared Memory

C2

Intel Core2

L2 L2

X-axis(foreground)

1GB/s

470.lbm(background)

C2.2GB/s

Page 25: Apr 9, 2012

SPEC2006 Results

• Improve overall throughput– background: 368%, foreground(X-axis): 6%

25

470.lbm

437.leslie

3d

462.libquantum

410.bwaves

471.omnetpp

459.GemsFD

TD

482.sphinx3

429.mcf

450.soplex

433.milc

434.zeusm

p

483.xalancb

mk

436.cactu

sADM

403.gcc

456.hmmer

473.astar

401.bzip2

400.perlbench

447.dealII

454.calcu

lix

464.h264ref

445.gobmk

458.sjeng

435.gromacs

481.wrf

444.namd

465.tonto

416.gamess

453.povray

geomean0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

foreground (X-axis) @1.0GB/s background (470.lbm) @0.2GB/s

Nor

mal

ized

IPC Normalized to guaranteed performance

C0

Shared Memory

C2

Intel Core2

L2 L2

X-axis(foreground)

1GB/s

470.lbm(background)

C2.2GB/s

Page 26: Apr 9, 2012

Conclusion

• Inter-Core Memory Interference – Big challenge for multi-core based real-time systems– Sources: queuing delay in MC, state dependent latency in

DRAM

• MemGuard– OS mechanism providing efficient per-core memory

performance isolation on COTS H/W– Memory bandwidth reservation and reclaiming support– https://github.com/heechul/memguard/wiki/MemGuard

26

Page 27: Apr 9, 2012

Thank you.

27

Page 28: Apr 9, 2012

Effect of Reclaim

• IPC improvement of background ([email protected]/s) is 3.8x• IPC reduction of foreground ([email protected]/s) is 3%

28

Page 29: Apr 9, 2012

Reclaim Underrun Error

29

Page 30: Apr 9, 2012

Effect of Spare Sharing

• IPC of background ([email protected]/s) improves 40%• IPC of foreground ([email protected]/s) also improves 9%

30

Page 31: Apr 9, 2012

Isolation and Throughput Effect of rmin

• 4 core configuration

31

Page 32: Apr 9, 2012

Isolation Effect of Reservation

• Sum b/w reservation < rmin (1.2GB/s) Isolation– 1.0GB/s(X-axis) + 0.2GB/s(lbm) = rmin

32

Isolation

Core 0: 1.0 GB/s for X-axis

Core 2: 0.2 – 2.0 GB/s for lbm

Solo [email protected]/s

Page 33: Apr 9, 2012

Effect of MemGuard

• Soft real-time application on each core. • Provides differentiated memory bandwidth

– weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled

33

Page 34: Apr 9, 2012

Hard/Soft Reservation on MemGuard

• Hard reservation (w/o reclaiming)– Can guarantee memory bandwidth Bi regardless of other cores

at each period– Wasted if not used

• Soft reservation (w/ reclaiming)– Misprediction can caused missed b/w guarantee at each period– Error rate is small---less than 5%.

• Selectively applicable on per-core basis

34