jiang lin 1, qingda lu 2, xiaoning ding 2, zhao zhang 1, xiaodong zhang 2, and p. sadayappan 2...

1

Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P.

Sadayappan2

Gaining Insights into Multi-Core Cache Partitioning:

Bridging the Gap between Simulation and Real Systems

1 Department of ECE

Iowa State University

2 Department of CSE

The Ohio State University

22

Shared Caches Can be a Critical Bottleneck in Multi-Core Processors L2/L3 caches are shared by multiple cores

Intel Xeon 51xx (2core/L2) AMD Barcelona (4core/L3)Sun T2, ... (8core/L2)

Effective cache partitioning is critical to address the bottleneck caused by the conflicting accesses in shared caches.

Several hardware cache partitioning methods have been proposed with different optimization objectives Performance: [HPCA’02], [HPCA’04], [Micro’06]Fairness: [PACT’04], [ICS’07], [SIGMETRICS’07]QoS: [ICS’04], [ISCA’07]

Shared L2/L3 cache

Core Core …… Core

33

Limitations of Simulation-Based Studies

Excessive simulation timeWhole programs can not be evaluated. It

would take several weeks/months to complete a single SPEC CPU2006 benchmark

As the number of cores continues to increase, simulation ability becomes even more limited

Absence of long-term OS activitiesInteractions between processor/OS affect

performance significantlyProneness to simulation inaccuracy

Bugs in simulatorImpossible to model many dynamics and

details of the system

44

Our Approach to Address the Issues

Design and implement OS-based Cache PartitioningEmbedding cache partitioning mechanism in

OSBy enhancing page coloring technique To support both static and dynamic cache

partitioningEvaluate cache partitioning policies on

commodity processorsExecution- and measurement-basedRun applications to completionMeasure performance with hardware counters

55

Four Questions to Answer

Can we confirm the conclusions made by the simulation-based studies?

Can we provide new insights and findings that simulation is not able to?

Can we make a case for our OS-based approach as an effective option to evaluate multicore cache partitioning designs?

What are advantages and disadvantages for OS-based cache partitioning?

66

Outline

IntroductionDesign and implementation of OS-based

cache partitioning mechanismsEvaluation environment and workload

constructionCache partitioning policies and their resultsConclusion

77

OS-Based Cache Partitioning MechanismsStatic cache partitioning

Predetermines the amount of cache blocks allocated to each program at the beginning of its execution

Page coloring enhancementDivides shared cache to multiple regions and

partition cache regions through OS page address mapping

Dynamic cache partitioningAdjusts cache quota among processes dynamically Page re-coloringDynamically changes processes’ cache usage

through OS page address re-mapping

88

Page Coloring

virtual page numberVirtual address page offset

physical page numberPhysical address Page offset

Address translation

Cache tag Block offsetSet indexCache address

Physically indexed cache

page color bits

… …

OS control

=

•Physically indexed caches are divided into multiple regions (colors).•All cache lines in a physical page are cached in one of those regions (colors).

OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).

99

Enhancement for Static Cache Partitioning

… …...

………

………

Physically indexed cache

………

………

Physical pages are grouped to page binsaccording to their page color1

234

…

i+2

ii+1

…

Process 1

1234

…

i+2

ii+1

…

Process 2

OS

address m

apping

Shared cache is partitioned between two processes through address mapping.

Cost: Main memory space needs to be partitioned too (co-partitioning).

1010

Dynamic Cache Partitioning

Why?Programs have dynamic behaviorsMost proposed schemes are dynamic

How?Page re-coloring

How to handle overhead?Measure overhead by performance counterRemove overhead in result (emulating

hardware schemes)

1111

Allocated color

Dynamic Cache Partitioning through Page Re-Coloring

page links table

……

N - 1

0

1

2

3

Page re-coloring:Allocate page in new colorCopy memory contentsFree old page

Allocated color

Pages of a process are organized into linked lists by their colors.

Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.

1212

Control the Page Migration Overhead

Control the frequency of page migrationFrequent enough to capture application phase

changesNot too often to introduce large page migration

overhead

Lazy migration: avoid unnecessary page migrationObservation: Not all pages are accessed between

their two migrations.Optimization: do not migrate a page until it is

accessed

13

After the optimizationOn average, 2% page migration

overheadUp to 7%.13

Lazy Page Migration

Process page links

……

N - 1

0

1

2

3

Avoid unnecessary page migration for these pages!

Allocated color

Allocated color

1414

Outline

IntroductionDesign and implementation of OS-based

cache partitioning mechanismsEvaluation environment and workload

constructionCache partitioning policies and their resultsConclusion

1515

Experimental Environment

Dell PowerEdge1950Two-way SMP, Intel dual-core Xeon 5160Shared 4MB L2 cache, 16-way8GB Fully Buffered DIMM

Red Hat Enterprise Linux 4.02.6.20.3 kernelPerformance counter tools from HP (Pfmon)Divide L2 cache into 16 colors

1616

Benchmark Classification

Is it sensitive to L2 cache capacity? Red group: IPC(1M L2 cache)/IPC(4M L2 cache) < 80%

Give red benchmarks more cache: big performance gain Yellow group: 80% <IPC(1M L2 cache)/IPC(4M L2 cache)

< 95% Give yellow benchmarks more cache: moderate

performance gain

Else: Does it extensively access L2 cache? Green group: > = 14 accesses / 1K cycle

Give it small cache Black group: < 14 accesses / 1K cycle

Cache insensitive

29 benchmarks from SPEC CPU2006

6 9 6 8

1717

Workload Construction

6 9 6

6

9

6

2-core

RR (3 pairs)

RY (6 pairs)

RG (6 pairs)

YY (3 pairs)

YG (6 pairs) GG (3 pairs)

27 workloads: representative benchmark combinations

1818

Outline

IntroductionOS-based cache partitioning mechanismEvaluation environment and workload

constructionCache partitioning policies and their results

PerformanceFairness

Conclusion

1919

Performance – MetricsDivide metrics into evaluation metrics and policy

metrics [PACT’06]Evaluation metrics:

Optimization objectives, not always available during run-time

Policy metricsUsed to drive dynamic partitioning policies: available

during run-timeSum of IPC, Combined cache miss rate, Combined cache

misses

2020

Static PartitioningTotal #color of cache: 16Give at least two colors to each program

Make sure that each program get 1GB memory to avoid swapping (because of co-partitioning)

Try all possible partitionings for all workloads(2:14), (3:13), (4:12) ……. (8,8), ……, (13:3),

(14:2)Get value of evaluation metricsCompared with performance of all partitionings

with performance of shared cache

2121

Performance – Optimal Static Partitioning

Performance gai n of opti mal stati c parti ti oni ng

1.00

1.05

1.10

1.15

1.20

1.25

RR RY RG YY YG GG

Throughtput Average Weighted Speedup Normalized SMT Speedup Fair Speedup

Confirm that cache partitioning has significant performance impact

Different evaluation metrics have different performance gains

RG-type of workloads have largest performance gains (up to 47%)

Other types of workloads also have performance gains (2% to 10%)

2222

A New Finding

Workload RG1: 401.bzip2 (Red) + 410.bwaves (Green)

Intuitively, giving more cache space to 401.bzip2 (Red)Increases the performance of 401.bzip2 largely

(Red)Decreases the performance of 410.bwaves

slightly (Green)

However, we observe that

23

Memory Bandwidth Utilization

2.702.752.802.852.902.953.003.05

2:14

3:13

4:12

5:11

6:10 7:9

8:8

9:7

10:6

11:5

12:4

13:3

14:2

Partitionings

GB/s

Average Memory Access Latency

140142144146148150152154156

2:14

3:13

4:12

5:11

6:10 7:9

8:8

9:7

10:6

11:5

12:4

13:3

14:2

Partitionings

ns

23

Insight into Our Finding

24

Insight into Our Finding

We have the same observation in RG4, RG5 and YG5

This is not observed by simulation Did not model main memory sub-system in

detail Assumed fixed memory access latency

Shows the advantages of our execution- and measurement-base study

2525

Performance - Dynamic Partition Policy

Init: Partition the cache as (8:8)

Run current partition (P0:P1) for one epoch

finished

Try one epoch for each of the two neighboringpartitions: (P0 – 1: P1+1) and (P0 + 1: P1-1)

Choose next partitioning with best policy metrics measurement

No

YesExit

A simple greedy policy.

Emulate policy of [HPCA’02]

2626

Performance – Static & Dynamic

Use combined miss rates as policy metricsFor RG-type, and some RY-type:

Static partitioning outperforms dynamic partitioningFor RR- and RY-type, and some RY-type

Dynamic partitioning outperforms static partitioning

2727

Fairness – Metrics and Policy [PACT’04]Metrics

Evaluation metrics FM0 difference in slowdown, small is better

Policy metrics

PolicyRepartitioning and rollback

2828

Fairness - Result

Dynamic partitioning can achieve better fairness If we use FM0 as both evaluation metrics and policy

metrics None of policy metrics (FM1 to FM5) is good enough to

drive the partitioning policy to get comparable fairness with static partitioning

Strong correlation was reported in simulation-based study – [PACT’04]

None of policy metrics has consistently strong correlation with FM0 SPEC CPU2006 (ref input) SPEC CPU2000 (test input)

Complete trillions of instructions less than one billion instruction

4MB L2 cache 512KB L2 cache

2929

Conclusion

Confirmed some conclusions made by simulationsProvided new insights and findings

Give cache space from one to another, increase performance of both

Poor correlation between evaluation and policy metrics for fairness

Made a case for our OS-based approach as an effective option for evaluation of multicore cache partitioning

Advantages of OS-based cache partitioning Working on commodity processors for an execution- and

measurement-based studyDisadvantages of OS-based cache partitioning

Co-partitioning (may underutilize memory), migration overhead

3030

Ongoing Work

Reduce migration overhead on commodity processors

Cache partitioning at the compiler levelPartition cache at object level

Hybrid cache partitioning methodRemove the cost of co-partitioningAvoid page migration overhead

31

Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P.

Sadayappan2

Gaining Insights into Multi-Core Cache Partitioning:

Bridging the Gap between Simulation and Real Systems

1 Iowa State University

2 The Ohio State University

Thanks!

3232

Backup Slides

3333

Fairness - Correlation between Evaluation Metrics and Policy Metrics (Reported by [PACT’04])

-1-0.8-0.6-0.4-0.200.20.40.60.81

apsi+equake gzip+apsi swim+gzip tree+mcf AVG18

Corr(M1,M0) Corr(M2,M0) Corr(M3,M0)

Corr(M4,M0) Corr(M5,M0)

Strong correlation was reported in simulation study – [PACT’04]

3434

Fairness - Correlation between Evaluation Metrics and Policy Metrics (Our result)

None of policy metrics has consistently strong correlation with FM0SPEC CPU2006 (ref input) SPEC CPU2000 (test input) Complete trillions of instructions less than one billion

instruction4MB L2 cache 512KB L2 cache

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

YY1 YY2 YY3 YG1 YG2 YG3 YG4 YG5 YG6 GG1 GG2 GG3

FM1

FM3

FM4

FM5

jiang lin 1, qingda lu 2, xiaoning ding 2, zhao zhang 1, xiaodong zhang 2, and p. sadayappan 2...

Documents

cache lines

cache blocks

multicore cache partitioning

partition cache regions

osbased approach

implementation of os

page coloring technique

simulationbased studies