exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors matt...

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT ProcessorsMatt DeVuyst

Rakesh Kumar

Dean Tullsen

IPDPS: DeVuyst, Kumar, Tullsen 2

Some Definitions

Balanced schedule: A schedule of threads to

contexts such that the number of threads per core is equal

Unbalanced schedule: A schedule of threads to

contexts such that the number of threads per core is not equal

Core 1 Core 2 Core 3

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Core 1 Core 2 Core 3

Thread 1

Thread 2

Thread 3

Thread 4 Thread 5

Thread 6

Thread 7

Thread 7


Why a CMP of SMT cores?

Chip makers are manufacturing more Chip Multiprocessors (CMP) with Simultaneous Multithreading (SMT) Power5 Niagra

Very little work has been done on thread scheduling for such an architecture

Scheduling on this architecture is challenging


Application Diversity

Different applications have different needs One way to effectively cope with application

diversity is hardware heterogeneity [Kumar03]


Hardware Heterogeneity

Threads

Cores


Application Diversity

Different applications have different needs One way to effectively cope with application

diversity is hardware heterogeneity Another way to deal with application diversity

is soft heterogeneity


Soft Heterogeneity

Threads

SMTCores


Scheduling Complexity

Given a 4 core CMP,with 4 contexts per core,and 12 threads

There are 15,400 balanced schedules There are 644,875 unbalanced schedules

Core

Context


Our Goals

Find good scheduling policies System-level scheduling

→ Granularity is an OS time-slice

Optimize for both power and performance Performance Power Energy Energy Delay Product (EDP)

= Energy * Performance


Outline

Architecture Methodology Scheduling Policies Conclusions


Architecture

4 SMT cores 4 contexts per core Shared L2, L3 Cores can be power-

gatedL2 and L3 Caches

Ctx Ctx

Ctx Ctx

Shared L1s

Ctx Ctx

Ctx Ctx

Shared L1s

Ctx Ctx

Ctx Ctx

Shared L1s

Ctx Ctx

Ctx Ctx

Shared L1s


Methodology

Benchmarks 12 SPEC 2k benchmarks TLP varied from 4,6,8,12,16 8 benchmark sets for each level of TLP

Each benchmark is given fair coverage Dynamic scheduling policies seeded with the best

static schedule A variant of SMTSIM and a CMP-aware version of

Wattch


Outline

Architecture Methodology Scheduling Policies

Naïve balanced scheduling policy Sampling-based policies Electron policies

Conclusions


Naïve Balanced Scheduling Policy Main idea

Spreading threads evenly across cores results in good resource utilization

How it works Each thread is assigned to a context such that the

resulting schedule is balanced. The schedule is changed randomly over time.

This was our baseline for comparison Easy to implement Most common


What We Learn From Static Schedules

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4 6 8 12

Number of Threads

ED

P S

avin

gs

Static Ideal

Static Balanced

Baseline isNaïve BalancedDynamic Policy


Outline



Conclusions


Sampling-based Policies

Main idea Try different schedules to find an effective one Oblivious to underlying hardware

How they work Two alternating phases

Sampling phase: different schedules are sampled Steady phase: best schedule from sampling phase is

used Steady phase is much longer than sampling phase


Outline


Naïve balanced scheduling policy Sampling-based policies

Symbiosis policies [Snavely02] “Prefer Last” policies

Electron policies Conclusions


Symbiosis Policy

Main idea Some threads run well together, others do not

How it works Sampling phase: random schedules created,

performance sampled. Steady phase: the schedule in which threads

achieve the most symbiosis is run Two versions:

Balanced: only balanced schedules considered Unbalanced


Symbiosis Policy

-2%

0%

2%

4%

6%

8%

10%

4 6 8 12

Number of Threads

ED

P S

avin

gs

Symbiosis

Balanced Symbiosis

Baseline isNaïve Balanced


Outline


Naïve balanced scheduling policy Sampling-based policies

Symbiosis policies “Prefer Last” policies

Electron policies Conclusions


“Prefer Last” Policies

Main idea Current schedules has merit A similar schedule might be a little better

How they work Create multiple permutations on the current

schedule Create a few random samples to prevent

remaining in only local minima Sample schedules and pick the best


Sampling Based Policies

-2%

0%

2%

4%

6%

8%

10%

4 6 8 12

Number of Threads

ED

P S

avin

gs

Prefer Last - Numbers

Prefer Last - Sw ap

Prefer Last - Move


Sampling Based Policies

-2%

0%

2%

4%

6%

8%

10%

4 6 8 12

Number of Threads

ED

P S

avin

gs

Symbiosis

Balanced Symbiosis

Prefer Last - Move


Issues With Sampling Based Policies Non-scalable

Search space grows → number of samples grow Overhead of sampling

Some schedules result in improvement …but most just make things worse


Outline



Conclusions


Electron Policies

Main idea One core attracts a thread Another core repels a thread.

How it works (EDP) Highest EDP core identified Lowest EDP core identified A thread running on the low EDP core is moved to

the high EDP core


Electron Policies

t1 t2 t3

t4 t5 t6 t7

t8

Core 1 Core 2

Core 3 Core 4

Core with thehighest EDP

Core with thelowest EDP


Electron Policy Results

0%

2%

4%

6%

8%

10%

12%

4 6 8 12

Number of Threads

ED

P S

avin

gs

Balanced Symbiosis

Prefer Last - Move

Electron


Outline



Conclusions


Conclusions

A good scheduling policy for a CMP of SMTs must consider unbalanced schedules to achieve the most efficiency.

“Prefer Last” policies yield more energy savings than symbiotic scheduling policies and the naïve balanced policy.

Electron policies have low overhead and are particularly effective well when TLP is high.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT ProcessorsMatt DeVuyst

Rakesh Kumar

Dean Tullsen

exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors matt...

Documents

unbalanced thread scheduling

performance slide

core cmp

architecture scheduling

challenging slide

common slide

schedule of threads

soft heterogeneity slide