coordinated multiprocessor frequency and folding control · 2013-12-22 · fluidanimate 0.0 0.5 1.0...

Coordinated MultiprocessorFrequency and Folding Control

Motivation Evaluation: SPECpower Benchmark

Augusto Vega1, Alper Buyuktosunoglu1, Heather Hanson2, Pradip Bose1, Srinivasan Ramani21IBM T. J. Watson Research Center, 2IBM Systems & Technology Group

Performance and Throughput AwarenessMotivation Evaluation: SPECpower BenchmarkPerformance and Throughput Awareness• Modern multi-core systems incorporate support for

dynamic power management with multiple actuators Dynamic voltage and frequency scaling (DVFS) Core folding Per-core power gating (PCPG) PCPGDVFS

DVFS Control

Algorithm

PCPG Control

Algorithm Operate power management knobs dependingon if an application’s current execution phase issingle-thread performance or throughput bound

• SPECpower benchmark JVMs represent “warehouses”

that receive requests from clients Load levels: 100% .. 10%

PAMPA preserves throughput

1.00

1.25

ssj_

ops

No PM

PCPG-Only

• Algorithms that control these actuators evolved independently• Their independent operation can result in conflicting decisions

that can lead to undesirable effects on performance and power

We define three scenarios:A. Symptom: All turned-on cores are highly utilized

Diagnosis: Application likely bound by throughputAction: Turn cores on

B. Symptom: Some turned-on cores are highly utilizedDiagnosis: Application likely bound by single-thread

100%

off

100% 100% 100%

off off off

100% 100% 100% 0%

Throughput metric: SPECpoweroperations per second (ssj_ops)

PAMPA slightlyincreases system

power

0.00

0.25

0.50

0.75

100 90 80 70 60 50 40 30 20 10

ssj_

ops PCPG-Only

DVFS-Only

PCPG+DVFS

PAMPA

0.50

0.75

1.00

1.25

Syst

em P

ower No PM

PCPG-Only

DVFS-Only

Sometimes Coordination Matters

We argue in favor of a coordinated control of these actuators to avoid such potential conflicts in dynamic power management

Diagnosis: Application likely bound by single-threadperformance

Action: Increase frequency

C. Symptom: All turned-on cores are low utilized or idleDiagnosis: Enough “slack” to safely decrease frequency

or turn cores offAction: Decrease frequency or turn cores off

100%

20%

100% 100% 0%

0% off off

15%

0%

30% 20% 20%

25% 0% off

• For this benchmark: PCPG+DVFS operates properly PAMPA exhibits efficiency levels

close to PCPG+DVFS

power

0.00

0.25

0.50

100 90 80 70 60 50 40 30 20 10

Syst

em P

ower

DVFS-Only

PCPG+DVFS

PAMPA

1.01.52.02.53.03.54.04.5

Freq

uenc

y (G

Hz)

No PM

PCPG-Only

DVFS-Only

PCPG+DVFSSometimes Coordination Matters

The PAMPA AlgorithmThis? Or this?

What is more appropriate?

0% 25% 0% off

POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORS

PAMPA aggressivelyactuates DVFS and PCPG

0.00.51.0

100 90 80 70 60 50 40 30 20 10

Freq

uenc

y (G

Hz)

PAMPA

02468

10121416

On

Core

s

No PM

PCPG-Only

DVFS-Only

PCPG+DVFS

PAMPA

Evaluation: PARSEC Benchmarks

Pedals

POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORSALGORITHM

Only enabled during stable • PARSEC benchmarks

PAMPA preserves performance

0

100 90 80 70 60 50 40 30 20 10Load Level (%)

Decoupled Power Management

Pedals

Coordinated Control Decoupled Control

during stable execution phases

• PARSEC benchmarks Multi-threaded CPU-intensive Sensitive to

frequency scaling

• For these benchmarks: PCPG+DVFS degrades

performance

0.0

0.5

1.0

1.5

2.0

Exec

utio

n Ti

me

Fluidanimate

0.0

0.5

1.0

1.5

2.0Streamcluster

2.0 2.0Decoupled Power Management• Power control techniques have evolved in a decoupled manner

Less complexity Different timescale granularities

• Their independent actuation can lead to conflicting decisionsthat jeopardize system power-performance efficiency

PCPG+DVFS degradesperformance excessively

PAMPA preservesperformance Prevents frequency

downscaling

0.0

0.5

1.0

1.5

Syst

em P

ower

0.0

0.5

1.0

1.5

0.5

1.0

1.5

2.0

EPI

0.5

1.0

1.5

2.0

that jeopardize system power-performance efficiency

Goal: keep core utilization as high as possible

PAMPA properlyactuates DVFS

and PCPG

0.0

0.5

0.0

0.5

0.00.51.01.52.02.53.03.54.04.5

Freq

uenc

y (G

Hz)

0.00.51.01.52.02.53.03.54.04.5

121416

121416

Methodology

and PCPG

02468

1012

8 16 32 64 Average

On

Core

s

Software Threads

02468

1012

8 16 32 64 Average

Software Threads

No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA

MethodologyEvaluation: HPC Applications

• IBM POWER7+ system with AIX O/S• Two 8-core processors (16 cores total) • Lulesh and CoMD

Lulesh:Livermore UnstructuredLagrangian Explicit Shock

PAMPA preserves performance

CoMDLulesh

A robust coordination protocol is necessary in

Frequency is downscaled to risky levels

POWER7+ POWER7+

POWER Hypervisor

OS (AIX)Application Application

PAMPAPCPGPCPG

PerformanceCounters

PerformanceCounters

PAMPA operates at O/S levelEstimates core-level

utilization at run-timeActuates core folding

and PCPG at O/S levelConnects to the EnergyScale

Lagrangian Explicit ShockHydrodynamics

CoMD: classical moleculardynamics algorithm

Message PassingInterface (MPI)

CPU-intensive Sensitive to

0.0

0.5

1.0

1.5

0.5

1.0

1.5

0.0

0.5

1.0

1.5

Exec

utio

n Ti

me

0.5

1.0

1.5

Syst

em P

ower

When the Decoupled Approach Does Not Work

A robust coordination protocol is necessary inorchestrating the power management functions

P7+ System

EnergyScaleMicrocontroller

DVFSDVFS

Connects to the EnergyScaleµcontroller to adjust processor-level voltage and frequency

Sensitive tofrequency scaling

PAMPA properlyactuates DVFS

and PCPG

0.0

0.5

0.0

0.5

1.0

1.5

3.54.04.5

0.0

0.5

Syst

em P

ower

0.0

0.5

1.0

1.5

EPI

3.54.04.5

Freq

uenc

y (G

Hz)

• Fluidanimate benchmark Multi-threaded application

from the PARSEC suite Fluid motion physics simulation

for real-time animation purposes

• We evaluate two scenarios1. Only PCPG baseline2. Decoupled PCPG+DVFS

• Five scenarios evaluated:1. No power management2. PCPG Only3. DVFS Only4. PCPG + DVFS baseline5. PAMPA our approach

• Single-thread performance- andthroughput-oriented benchmarks:• SPECpower_ssj2008• Four PARSEC applications• Two HPC applications

• For Lulesh: PCPG+DVFS degrades

performance excessively PAMPA preserves

performance

0.00.51.01.52.02.53.03.5

2468

10121416

0.00.51.01.52.02.53.03.5

Freq

uenc

y (G

Hz)

2468

10121416

On

Core

s

Summary

performance Prevents frequency

downscaling

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

to P

CPG

Fluidanimate

SpeedupSystem PowerEfficiency

! ! ! !

Performance

02

1 4 8 16 Average

Software Threads

02

1 8 27 64 Average

Software Threads

No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA

• Modern systems incorporate multipleactuators for dynamic power management

• Algorithms that control these actuators have evolved independently• Their independent operation can result in conflicting decisions

that can lead to undesirable effects on performance and power• We propose a coordinated, in-order actuation of the power knobs

Why should I use PAMPA? Exhibits power-performance efficiencies comparable to

the most aggressive, decoupled PCPG+DVFS approach2.2

3.3

4.4

8

12

16

Freq

uenc

y (G

Hz)

On

Core

Cou

nt

Fluidanimate

0.0

8 16 32 64

Software Threads

EfficiencyPerformanceis hurt at

unacceptable levels!

• We propose a coordinated, in-order actuation of the power knobs

• PAMPA: Power-Aware Management of Processor Actuators• Dynamically “detects” if an application is single-thread performance

or throughput bound and actuates the knobs accordingly

the most aggressive, decoupled PCPG+DVFS approach

Avoids excessive performance degradation in cases where PCPG+DVFS results in conflicting decisions0.0

1.1

2.2

0

4

8

8 16 32 64

Freq

uenc

y (G

Hz)

On

Core

Cou

nt

Software Threads

On CoresFreq. (GHz)

coordinated multiprocessor frequency and folding control · 2013-12-22 · fluidanimate 0.0 0.5 1.0...

Documents