coordinated multiprocessor frequency and folding control · 2013-12-22 · fluidanimate 0.0 0.5 1.0...

Coordinated MultiprocessorFrequency and Folding Control

Motivation Evaluation: SPECpower Benchmark

Augusto Vega1, Alper Buyuktosunoglu1, Heather Hanson2, Pradip Bose1, Srinivasan Ramani21IBM T. J. Watson Research Center, 2IBM Systems & Technology Group

Performance and Throughput AwarenessMotivation Evaluation: SPECpower BenchmarkPerformance and Throughput Awareness• Modern multi-core systems incorporate support for

dynamic power management with multiple actuators Dynamic voltage and frequency scaling (DVFS) Core folding Per-core power gating (PCPG) PCPGDVFS

DVFS Control

Algorithm

PCPG Control

Algorithm Operate power management knobs dependingon if an application’s current execution phase issingle-thread performance or throughput bound

• SPECpower benchmark JVMs represent “warehouses”

that receive requests from clients Load levels: 100% .. 10%

PAMPA preserves throughput

PCPG-Only

• Algorithms that control these actuators evolved independently• Their independent operation can result in conflicting decisions

that can lead to undesirable effects on performance and power

We define three scenarios:A. Symptom: All turned-on cores are highly utilized

Diagnosis: Application likely bound by throughputAction: Turn cores on

B. Symptom: Some turned-on cores are highly utilizedDiagnosis: Application likely bound by single-thread

100% 100% 100%

off off off

100% 100% 100% 0%

Throughput metric: SPECpoweroperations per second (ssj_ops)

PAMPA slightlyincreases system

100 90 80 70 60 50 40 30 20 10

ops PCPG-Only

DVFS-Only

PCPG+DVFS

ower No PM

PCPG-Only

DVFS-Only

Sometimes Coordination Matters

We argue in favor of a coordinated control of these actuators to avoid such potential conflicts in dynamic power management

Diagnosis: Application likely bound by single-threadperformance

Action: Increase frequency

C. Symptom: All turned-on cores are low utilized or idleDiagnosis: Enough “slack” to safely decrease frequency

or turn cores offAction: Decrease frequency or turn cores off

100% 100% 0%

0% off off

30% 20% 20%

25% 0% off

• For this benchmark: PCPG+DVFS operates properly PAMPA exhibits efficiency levels

close to PCPG+DVFS

100 90 80 70 60 50 40 30 20 10

DVFS-Only

PCPG+DVFS

1.01.52.02.53.03.54.04.5

PCPG-Only

DVFS-Only

PCPG+DVFSSometimes Coordination Matters

The PAMPA AlgorithmThis? Or this?

What is more appropriate?

0% 25% 0% off

POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORS

PAMPA aggressivelyactuates DVFS and PCPG

0.00.51.0

100 90 80 70 60 50 40 30 20 10

10121416

PCPG-Only

DVFS-Only

PCPG+DVFS

Evaluation: PARSEC Benchmarks

Pedals

POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORSALGORITHM

Only enabled during stable • PARSEC benchmarks

PAMPA preserves performance

100 90 80 70 60 50 40 30 20 10Load Level (%)

Decoupled Power Management

Pedals

Coordinated Control Decoupled Control

during stable execution phases

• PARSEC benchmarks Multi-threaded CPU-intensive Sensitive to

frequency scaling

• For these benchmarks: PCPG+DVFS degrades

performance

Fluidanimate

2.0Streamcluster

2.0 2.0Decoupled Power Management• Power control techniques have evolved in a decoupled manner

Less complexity Different timescale granularities

• Their independent actuation can lead to conflicting decisionsthat jeopardize system power-performance efficiency

PCPG+DVFS degradesperformance excessively

PAMPA preservesperformance Prevents frequency

downscaling

that jeopardize system power-performance efficiency

Goal: keep core utilization as high as possible

PAMPA properlyactuates DVFS

and PCPG

0.00.51.01.52.02.53.03.54.04.5

121416

Methodology

and PCPG

8 16 32 64 Average

Software Threads

8 16 32 64 Average

Software Threads

No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA

MethodologyEvaluation: HPC Applications

• IBM POWER7+ system with AIX O/S• Two 8-core processors (16 cores total) • Lulesh and CoMD

Lulesh:Livermore UnstructuredLagrangian Explicit Shock

PAMPA preserves performance

CoMDLulesh

A robust coordination protocol is necessary in

Frequency is downscaled to risky levels

POWER7+ POWER7+

POWER Hypervisor

OS (AIX)Application Application

PAMPAPCPGPCPG

PerformanceCounters

PAMPA operates at O/S levelEstimates core-level

utilization at run-timeActuates core folding

and PCPG at O/S levelConnects to the EnergyScale

Lagrangian Explicit ShockHydrodynamics

CoMD: classical moleculardynamics algorithm

Message PassingInterface (MPI)

CPU-intensive Sensitive to

When the Decoupled Approach Does Not Work

A robust coordination protocol is necessary inorchestrating the power management functions

P7+ System

EnergyScaleMicrocontroller

DVFSDVFS

Connects to the EnergyScaleµcontroller to adjust processor-level voltage and frequency

Sensitive tofrequency scaling

PAMPA properlyactuates DVFS

and PCPG

3.54.04.5

• Fluidanimate benchmark Multi-threaded application

from the PARSEC suite Fluid motion physics simulation

for real-time animation purposes

• We evaluate two scenarios1. Only PCPG baseline2. Decoupled PCPG+DVFS

• Five scenarios evaluated:1. No power management2. PCPG Only3. DVFS Only4. PCPG + DVFS baseline5. PAMPA our approach

• Single-thread performance- andthroughput-oriented benchmarks:• SPECpower_ssj2008• Four PARSEC applications• Two HPC applications

• For Lulesh: PCPG+DVFS degrades

performance excessively PAMPA preserves

performance

0.00.51.01.52.02.53.03.5

10121416

0.00.51.01.52.02.53.03.5

10121416

Summary

performance Prevents frequency

downscaling

Fluidanimate

SpeedupSystem PowerEfficiency

! ! ! !

Performance

1 4 8 16 Average

Software Threads

1 8 27 64 Average

Software Threads

No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA

• Modern systems incorporate multipleactuators for dynamic power management

• Algorithms that control these actuators have evolved independently• Their independent operation can result in conflicting decisions

that can lead to undesirable effects on performance and power• We propose a coordinated, in-order actuation of the power knobs

Why should I use PAMPA? Exhibits power-performance efficiencies comparable to

the most aggressive, decoupled PCPG+DVFS approach2.2

Fluidanimate

8 16 32 64

Software Threads

EfficiencyPerformanceis hurt at

unacceptable levels!

• We propose a coordinated, in-order actuation of the power knobs

• PAMPA: Power-Aware Management of Processor Actuators• Dynamically “detects” if an application is single-thread performance

or throughput bound and actuates the knobs accordingly

the most aggressive, decoupled PCPG+DVFS approach

Avoids excessive performance degradation in cases where PCPG+DVFS results in conflicting decisions0.0

8 16 32 64

Software Threads

On CoresFreq. (GHz)

coordinated multiprocessor frequency and folding control · 2013-12-22 · fluidanimate 0.0 0.5 1.0...

Documents

characterizing the tlb behavior of emerging parallel...

does cache sharing on modern cmp matter to the performance...

a benchmark proposal for datacenter computing ·...

rrhh 2.0 nuevos perfiles profesionales. trabajador 2.0....

hospital 2.0 = actitud 2.0

biblioteca 2.0 +web 2.0

web 2.0 library 2.0 librarian 2.0

web 2.0, bibliotheek 2.0 & zorg 2.0

web 2.0 / enterprise 2.0

recrutement 2.0 & mariage 2.0

2.0 world: classroom 2.0, library 2.0, research 2.0

evaluating the impact of openmp 4.0 extensions on relevant...

sponsorointi 2.0 - sponsorship 2.0

web 2.0 - arbeitnehmerinnen 2.0

intune: coordinating multicore islands to achieve global...

banking 2.0 / arbeit 2.0

coz : finding code that counts with causal...

gela 2.0 - clase 2.0

web 2.0 library 2.0 librarian 2.0

gov 2.0 public 2.0 bad guys 2.0 v3