coordinated multiprocessor frequency and folding control · 2013-12-22 · fluidanimate 0.0 0.5 1.0...

1
Coordinated Multiprocessor Frequency and Folding Control Motivation Evaluation: SPECpower Benchmark Augusto Vega 1 , Alper Buyuktosunoglu 1 , Heather Hanson 2 , Pradip Bose 1 , Srinivasan Ramani 2 1 IBM T. J. Watson Research Center, 2 IBM Systems & Technology Group Performance and Throughput Awareness Motivation Evaluation: SPECpower Benchmark Performance and Throughput Awareness Modern multi-core systems incorporate support for dynamic power management with multiple actuators Dynamic voltage and frequency scaling (DVFS) Core folding Per-core power gating (PCPG) PCPG DVFS DVFS Control Algorithm PCPG Control Algorithm Operate power management knobs depending on if an application’s current execution phase is single-thread performance or throughput bound SPECpower benchmark JVMs represent “warehouses” that receive requests from clients Load levels: 100% .. 10% PAMPA preserves throughput 1.00 1.25 s No PM PCPG-Only Algorithms that control these actuators evolved independently Their independent operation can result in conflicting decisions that can lead to undesirable effects on performance and power We define three scenarios: A. Symptom: All turned-on cores are highly utilized Diagnosis: Application likely bound by throughput Action: Turn cores on B. Symptom: Some turned-on cores are highly utilized Diagnosis: Application likely bound by single-thread 100% off 100% 100% 100% off off off 100% 100% 100% 0% Throughput metric: SPECpower operations per second (ssj_ops) PAMPA slightly increases system power 0.00 0.25 0.50 0.75 100 90 80 70 60 50 40 30 20 10 ssj_ops PCPG-Only DVFS-Only PCPG+DVFS PAMPA 0.50 0.75 1.00 1.25 em Power No PM PCPG-Only DVFS-Only Sometimes Coordination Matters We argue in favor of a coordinated control of these actuators to avoid such potential conflicts in dynamic power management Diagnosis: Application likely bound by single-thread performance Action: Increase frequency C. Symptom: All turned-on cores are low utilized or idle Diagnosis: Enough “slack” to safely decrease frequency or turn cores off Action: Decrease frequency or turn cores off 100% 20% 100% 100% 0% 0% off off 15% 0% 30% 20% 20% 25% 0% off For this benchmark: PCPG+DVFS operates properly PAMPA exhibits efficiency levels close to PCPG+DVFS 0.00 0.25 0.50 100 90 80 70 60 50 40 30 20 10 Syste PCPG+DVFS PAMPA 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 requency (GHz) No PM PCPG-Only DVFS-Only PCPG+DVFS Sometimes Coordination Matters The PAMPA Algorithm This? Or this? What is more appropriate? POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORS PAMPA aggressively actuates DVFS and PCPG 0.0 0.5 1.0 100 90 80 70 60 50 40 30 20 10 Fr PAMPA 0 2 4 6 8 10 12 14 16 On Cores No PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA Evaluation: PARSEC Benchmarks Pedals POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORS ALGORITHM Only enabled during stable PARSEC benchmarks PAMPA preserves performance 0 100 90 80 70 60 50 40 30 20 10 Load Level (%) Decoupled Power Management Pedals Coordinated Control Decoupled Control during stable execution phases Multi-threaded CPU-intensive Sensitive to frequency scaling For these benchmarks: PCPG+DVFS degrades performance 0.0 0.5 1.0 1.5 2.0 Execution Time Fluidanimate 0.0 0.5 1.0 1.5 2.0 Streamcluster 2.0 2.0 Decoupled Power Management Power control techniques have evolved in a decoupled manner Less complexity Different timescale granularities Their independent actuation can lead to conflicting decisions that jeopardize system power-performance efficiency PCPG+DVFS degrades performance excessively PAMPA preserves performance Prevents frequency downscaling 0.0 0.5 1.0 1.5 System Power 0.0 0.5 1.0 1.5 0.5 1.0 1.5 2.0 EPI 0.5 1.0 1.5 2.0 that jeopardize system power-performance efficiency Goal: keep core utilization as high as possible PAMPA properly actuates DVFS and PCPG 0.0 0.5 0.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Frequency (GHz) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 12 14 16 12 14 16 Methodology and PCPG 0 2 4 6 8 10 12 8 16 32 64 Average On Cores Software Threads 0 2 4 6 8 10 12 8 16 32 64 Average Software Threads No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA Methodology Evaluation: HPC Applications IBM POWER7+ system with AIX O/S Two 8-core processors (16 cores total) Lulesh and CoMD Lulesh: Livermore Unstructured Lagrangian Explicit Shock PAMPA preserves performance CoMD Lulesh A robust coordination protocol is necessary in Frequency is downscaled to risky levels POWER7+ POWER7+ POWER Hypervisor OS (AIX) Application Application PAMPA PCPG Performance Counters PAMPA operates at O/S level Estimates core-level utilization at run-time Actuates core folding and PCPG at O/S level Connects to the EnergyScale Lagrangian Explicit Shock Hydrodynamics CoMD: classical molecular dynamics algorithm Message Passing Interface (MPI) CPU-intensive Sensitive to 0.0 0.5 1.0 1.5 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Execution Time 0.5 1.0 1.5 stem Power When the Decoupled Approach Does Not Work A robust coordination protocol is necessary in orchestrating the power management functions P7+ System EnergyScale Microcontroller DVFS Connects to the EnergyScale μcontroller to adjust processor- level voltage and frequency Sensitive to frequency scaling PAMPA properly actuates DVFS and PCPG 0.0 0.5 0.0 0.5 1.0 1.5 3.5 4.0 4.5 0.0 0.5 Sys 0.0 0.5 1.0 1.5 EPI 3.5 4.0 4.5 Hz) Fluidanimate benchmark Multi-threaded application from the PARSEC suite Fluid motion physics simulation for real-time animation purposes We evaluate two scenarios 1. Only PCPG baseline 2. Decoupled PCPG+DVFS Five scenarios evaluated: 1. No power management 2. PCPG Only 3. DVFS Only 4. PCPG + DVFS baseline 5. PAMPA our approach Single-thread performance- and throughput-oriented benchmarks: SPECpower_ssj2008 Four PARSEC applications Two HPC applications For Lulesh: PCPG+DVFS degrades performance excessively PAMPA preserves performance 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 2 4 6 8 10 12 14 16 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Frequency (GH 2 4 6 8 10 12 14 16 On Cores Summary performance Prevents frequency downscaling 0.0 0.2 0.4 0.6 0.8 1.0 Normalized to PCPG Fluidanimate Speedup System Power Efficiency ! ! ! ! Performance 0 2 1 4 8 16 Average Software Threads 0 2 1 8 27 64 Average Software Threads No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA Modern systems incorporate multiple actuators for dynamic power management Algorithms that control these actuators have evolved independently Their independent operation can result in conflicting decisions that can lead to undesirable effects on performance and power We propose a coordinated, in-order actuation of the power knobs Why should I use PAMPA? Exhibits power-performance efficiencies comparable to the most aggressive, decoupled PCPG+DVFS approach 2.2 3.3 4.4 8 12 16 ncy (GHz) re Count Fluidanimate 0.0 8 16 32 64 Software Threads Performance is hurt at unacceptable levels! We propose a coordinated, in-order actuation of the power knobs PAMPA: Power-Aware Management of Processor Actuators Dynamically “detects” if an application is single-thread performance or throughput bound and actuates the knobs accordingly the most aggressive, decoupled PCPG+DVFS approach Avoids excessive performance degradation in cases where PCPG+DVFS results in conflicting decisions 0.0 1.1 2.2 0 4 8 8 16 32 64 Frequen On Cor Software Threads On Cores Freq. (GHz)

Upload: others

Post on 07-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coordinated Multiprocessor Frequency and Folding Control · 2013-12-22 · Fluidanimate 0.0 0.5 1.0 1.5 2.0 Streamcluster 2.0 2.0 • Power control techniques have evolved in a decoupled

Coordinated MultiprocessorFrequency and Folding Control

Motivation Evaluation: SPECpower Benchmark

Augusto Vega1, Alper Buyuktosunoglu1, Heather Hanson2, Pradip Bose1, Srinivasan Ramani21IBM T. J. Watson Research Center, 2IBM Systems & Technology Group

Performance and Throughput AwarenessMotivation Evaluation: SPECpower BenchmarkPerformance and Throughput Awareness• Modern multi-core systems incorporate support for

dynamic power management with multiple actuators Dynamic voltage and frequency scaling (DVFS) Core folding Per-core power gating (PCPG) PCPGDVFS

DVFS Control

Algorithm

PCPG Control

Algorithm Operate power management knobs dependingon if an application’s current execution phase issingle-thread performance or throughput bound

• SPECpower benchmark JVMs represent “warehouses”

that receive requests from clients Load levels: 100% .. 10%

PAMPA preserves throughput

1.00

1.25

ssj_

ops

No PM

PCPG-Only

• Algorithms that control these actuators evolved independently• Their independent operation can result in conflicting decisions

that can lead to undesirable effects on performance and power

We define three scenarios:A. Symptom: All turned-on cores are highly utilized

Diagnosis: Application likely bound by throughputAction: Turn cores on

B. Symptom: Some turned-on cores are highly utilizedDiagnosis: Application likely bound by single-thread

100%

off

100% 100% 100%

off off off

100% 100% 100% 0%

Throughput metric: SPECpoweroperations per second (ssj_ops)

PAMPA slightlyincreases system

power

0.00

0.25

0.50

0.75

100 90 80 70 60 50 40 30 20 10

ssj_

ops PCPG-Only

DVFS-Only

PCPG+DVFS

PAMPA

0.50

0.75

1.00

1.25

Syst

em P

ower No PM

PCPG-Only

DVFS-Only

Sometimes Coordination Matters

We argue in favor of a coordinated control of these actuators to avoid such potential conflicts in dynamic power management

Diagnosis: Application likely bound by single-threadperformance

Action: Increase frequency

C. Symptom: All turned-on cores are low utilized or idleDiagnosis: Enough “slack” to safely decrease frequency

or turn cores offAction: Decrease frequency or turn cores off

100%

20%

100% 100% 0%

0% off off

15%

0%

30% 20% 20%

25% 0% off

• For this benchmark: PCPG+DVFS operates properly PAMPA exhibits efficiency levels

close to PCPG+DVFS

power

0.00

0.25

0.50

100 90 80 70 60 50 40 30 20 10

Syst

em P

ower

DVFS-Only

PCPG+DVFS

PAMPA

1.01.52.02.53.03.54.04.5

Freq

uenc

y (G

Hz)

No PM

PCPG-Only

DVFS-Only

PCPG+DVFSSometimes Coordination Matters

The PAMPA AlgorithmThis? Or this?

What is more appropriate?

0% 25% 0% off

POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORS

PAMPA aggressivelyactuates DVFS and PCPG

0.00.51.0

100 90 80 70 60 50 40 30 20 10

Freq

uenc

y (G

Hz)

PAMPA

02468

10121416

On

Core

s

No PM

PCPG-Only

DVFS-Only

PCPG+DVFS

PAMPA

Evaluation: PARSEC Benchmarks

Pedals

POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORSALGORITHM

Only enabled during stable • PARSEC benchmarks

PAMPA preserves performance

0

100 90 80 70 60 50 40 30 20 10Load Level (%)

Decoupled Power Management

Pedals

Coordinated Control Decoupled Control

during stable execution phases

• PARSEC benchmarks Multi-threaded CPU-intensive Sensitive to

frequency scaling

• For these benchmarks: PCPG+DVFS degrades

performance

0.0

0.5

1.0

1.5

2.0

Exec

utio

n Ti

me

Fluidanimate

0.0

0.5

1.0

1.5

2.0Streamcluster

2.0 2.0Decoupled Power Management• Power control techniques have evolved in a decoupled manner

Less complexity Different timescale granularities

• Their independent actuation can lead to conflicting decisionsthat jeopardize system power-performance efficiency

PCPG+DVFS degradesperformance excessively

PAMPA preservesperformance Prevents frequency

downscaling

0.0

0.5

1.0

1.5

Syst

em P

ower

0.0

0.5

1.0

1.5

0.5

1.0

1.5

2.0

EPI

0.5

1.0

1.5

2.0

that jeopardize system power-performance efficiency

Goal: keep core utilization as high as possible

PAMPA properlyactuates DVFS

and PCPG

0.0

0.5

0.0

0.5

0.00.51.01.52.02.53.03.54.04.5

Freq

uenc

y (G

Hz)

0.00.51.01.52.02.53.03.54.04.5

121416

121416

Methodology

and PCPG

02468

1012

8 16 32 64 Average

On

Core

s

Software Threads

02468

1012

8 16 32 64 Average

Software Threads

No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA

MethodologyEvaluation: HPC Applications

• IBM POWER7+ system with AIX O/S• Two 8-core processors (16 cores total) • Lulesh and CoMD

Lulesh:Livermore UnstructuredLagrangian Explicit Shock

PAMPA preserves performance

CoMDLulesh

A robust coordination protocol is necessary in

Frequency is downscaled to risky levels

POWER7+ POWER7+

POWER Hypervisor

OS (AIX)Application Application

PAMPAPCPGPCPG

PerformanceCounters

PerformanceCounters

PAMPA operates at O/S levelEstimates core-level

utilization at run-timeActuates core folding

and PCPG at O/S levelConnects to the EnergyScale

Lagrangian Explicit ShockHydrodynamics

CoMD: classical moleculardynamics algorithm

Message PassingInterface (MPI)

CPU-intensive Sensitive to

0.0

0.5

1.0

1.5

0.5

1.0

1.5

0.0

0.5

1.0

1.5

Exec

utio

n Ti

me

0.5

1.0

1.5

Syst

em P

ower

When the Decoupled Approach Does Not Work

A robust coordination protocol is necessary inorchestrating the power management functions

P7+ System

EnergyScaleMicrocontroller

DVFSDVFS

Connects to the EnergyScaleµcontroller to adjust processor-level voltage and frequency

Sensitive tofrequency scaling

PAMPA properlyactuates DVFS

and PCPG

0.0

0.5

0.0

0.5

1.0

1.5

3.54.04.5

0.0

0.5

Syst

em P

ower

0.0

0.5

1.0

1.5

EPI

3.54.04.5

Freq

uenc

y (G

Hz)

• Fluidanimate benchmark Multi-threaded application

from the PARSEC suite Fluid motion physics simulation

for real-time animation purposes

• We evaluate two scenarios1. Only PCPG baseline2. Decoupled PCPG+DVFS

• Five scenarios evaluated:1. No power management2. PCPG Only3. DVFS Only4. PCPG + DVFS baseline5. PAMPA our approach

• Single-thread performance- andthroughput-oriented benchmarks:• SPECpower_ssj2008• Four PARSEC applications• Two HPC applications

• For Lulesh: PCPG+DVFS degrades

performance excessively PAMPA preserves

performance

0.00.51.01.52.02.53.03.5

2468

10121416

0.00.51.01.52.02.53.03.5

Freq

uenc

y (G

Hz)

2468

10121416

On

Core

s

Summary

performance Prevents frequency

downscaling

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

to P

CPG

Fluidanimate

SpeedupSystem PowerEfficiency

! ! ! !

Performance

02

1 4 8 16 Average

Software Threads

02

1 8 27 64 Average

Software Threads

No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA

• Modern systems incorporate multipleactuators for dynamic power management

• Algorithms that control these actuators have evolved independently• Their independent operation can result in conflicting decisions

that can lead to undesirable effects on performance and power• We propose a coordinated, in-order actuation of the power knobs

Why should I use PAMPA? Exhibits power-performance efficiencies comparable to

the most aggressive, decoupled PCPG+DVFS approach2.2

3.3

4.4

8

12

16

Freq

uenc

y (G

Hz)

On

Core

Cou

nt

Fluidanimate

0.0

8 16 32 64

Software Threads

EfficiencyPerformanceis hurt at

unacceptable levels!

• We propose a coordinated, in-order actuation of the power knobs

• PAMPA: Power-Aware Management of Processor Actuators• Dynamically “detects” if an application is single-thread performance

or throughput bound and actuates the knobs accordingly

the most aggressive, decoupled PCPG+DVFS approach

Avoids excessive performance degradation in cases where PCPG+DVFS results in conflicting decisions0.0

1.1

2.2

0

4

8

8 16 32 64

Freq

uenc

y (G

Hz)

On

Core

Cou

nt

Software Threads

On CoresFreq. (GHz)