coordinated multiprocessor frequency and folding control · 2013-12-22 · fluidanimate 0.0 0.5 1.0...
TRANSCRIPT
Coordinated MultiprocessorFrequency and Folding Control
Motivation Evaluation: SPECpower Benchmark
Augusto Vega1, Alper Buyuktosunoglu1, Heather Hanson2, Pradip Bose1, Srinivasan Ramani21IBM T. J. Watson Research Center, 2IBM Systems & Technology Group
Performance and Throughput AwarenessMotivation Evaluation: SPECpower BenchmarkPerformance and Throughput Awareness• Modern multi-core systems incorporate support for
dynamic power management with multiple actuators Dynamic voltage and frequency scaling (DVFS) Core folding Per-core power gating (PCPG) PCPGDVFS
DVFS Control
Algorithm
PCPG Control
Algorithm Operate power management knobs dependingon if an application’s current execution phase issingle-thread performance or throughput bound
• SPECpower benchmark JVMs represent “warehouses”
that receive requests from clients Load levels: 100% .. 10%
PAMPA preserves throughput
1.00
1.25
ssj_
ops
No PM
PCPG-Only
• Algorithms that control these actuators evolved independently• Their independent operation can result in conflicting decisions
that can lead to undesirable effects on performance and power
We define three scenarios:A. Symptom: All turned-on cores are highly utilized
Diagnosis: Application likely bound by throughputAction: Turn cores on
B. Symptom: Some turned-on cores are highly utilizedDiagnosis: Application likely bound by single-thread
100%
off
100% 100% 100%
off off off
100% 100% 100% 0%
Throughput metric: SPECpoweroperations per second (ssj_ops)
PAMPA slightlyincreases system
power
0.00
0.25
0.50
0.75
100 90 80 70 60 50 40 30 20 10
ssj_
ops PCPG-Only
DVFS-Only
PCPG+DVFS
PAMPA
0.50
0.75
1.00
1.25
Syst
em P
ower No PM
PCPG-Only
DVFS-Only
Sometimes Coordination Matters
We argue in favor of a coordinated control of these actuators to avoid such potential conflicts in dynamic power management
Diagnosis: Application likely bound by single-threadperformance
Action: Increase frequency
C. Symptom: All turned-on cores are low utilized or idleDiagnosis: Enough “slack” to safely decrease frequency
or turn cores offAction: Decrease frequency or turn cores off
100%
20%
100% 100% 0%
0% off off
15%
0%
30% 20% 20%
25% 0% off
• For this benchmark: PCPG+DVFS operates properly PAMPA exhibits efficiency levels
close to PCPG+DVFS
power
0.00
0.25
0.50
100 90 80 70 60 50 40 30 20 10
Syst
em P
ower
DVFS-Only
PCPG+DVFS
PAMPA
1.01.52.02.53.03.54.04.5
Freq
uenc
y (G
Hz)
No PM
PCPG-Only
DVFS-Only
PCPG+DVFSSometimes Coordination Matters
The PAMPA AlgorithmThis? Or this?
What is more appropriate?
0% 25% 0% off
POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORS
PAMPA aggressivelyactuates DVFS and PCPG
0.00.51.0
100 90 80 70 60 50 40 30 20 10
Freq
uenc
y (G
Hz)
PAMPA
02468
10121416
On
Core
s
No PM
PCPG-Only
DVFS-Only
PCPG+DVFS
PAMPA
Evaluation: PARSEC Benchmarks
Pedals
POWER AWARE MANAGEMENT OF PROCESSOR ACTUATORSALGORITHM
Only enabled during stable • PARSEC benchmarks
PAMPA preserves performance
0
100 90 80 70 60 50 40 30 20 10Load Level (%)
Decoupled Power Management
Pedals
Coordinated Control Decoupled Control
during stable execution phases
• PARSEC benchmarks Multi-threaded CPU-intensive Sensitive to
frequency scaling
• For these benchmarks: PCPG+DVFS degrades
performance
0.0
0.5
1.0
1.5
2.0
Exec
utio
n Ti
me
Fluidanimate
0.0
0.5
1.0
1.5
2.0Streamcluster
2.0 2.0Decoupled Power Management• Power control techniques have evolved in a decoupled manner
Less complexity Different timescale granularities
• Their independent actuation can lead to conflicting decisionsthat jeopardize system power-performance efficiency
PCPG+DVFS degradesperformance excessively
PAMPA preservesperformance Prevents frequency
downscaling
0.0
0.5
1.0
1.5
Syst
em P
ower
0.0
0.5
1.0
1.5
0.5
1.0
1.5
2.0
EPI
0.5
1.0
1.5
2.0
that jeopardize system power-performance efficiency
Goal: keep core utilization as high as possible
PAMPA properlyactuates DVFS
and PCPG
0.0
0.5
0.0
0.5
0.00.51.01.52.02.53.03.54.04.5
Freq
uenc
y (G
Hz)
0.00.51.01.52.02.53.03.54.04.5
121416
121416
Methodology
and PCPG
02468
1012
8 16 32 64 Average
On
Core
s
Software Threads
02468
1012
8 16 32 64 Average
Software Threads
No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA
MethodologyEvaluation: HPC Applications
• IBM POWER7+ system with AIX O/S• Two 8-core processors (16 cores total) • Lulesh and CoMD
Lulesh:Livermore UnstructuredLagrangian Explicit Shock
PAMPA preserves performance
CoMDLulesh
A robust coordination protocol is necessary in
Frequency is downscaled to risky levels
POWER7+ POWER7+
POWER Hypervisor
OS (AIX)Application Application
PAMPAPCPGPCPG
PerformanceCounters
PerformanceCounters
PAMPA operates at O/S levelEstimates core-level
utilization at run-timeActuates core folding
and PCPG at O/S levelConnects to the EnergyScale
Lagrangian Explicit ShockHydrodynamics
CoMD: classical moleculardynamics algorithm
Message PassingInterface (MPI)
CPU-intensive Sensitive to
0.0
0.5
1.0
1.5
0.5
1.0
1.5
0.0
0.5
1.0
1.5
Exec
utio
n Ti
me
0.5
1.0
1.5
Syst
em P
ower
When the Decoupled Approach Does Not Work
A robust coordination protocol is necessary inorchestrating the power management functions
P7+ System
EnergyScaleMicrocontroller
DVFSDVFS
Connects to the EnergyScaleµcontroller to adjust processor-level voltage and frequency
Sensitive tofrequency scaling
PAMPA properlyactuates DVFS
and PCPG
0.0
0.5
0.0
0.5
1.0
1.5
3.54.04.5
0.0
0.5
Syst
em P
ower
0.0
0.5
1.0
1.5
EPI
3.54.04.5
Freq
uenc
y (G
Hz)
• Fluidanimate benchmark Multi-threaded application
from the PARSEC suite Fluid motion physics simulation
for real-time animation purposes
• We evaluate two scenarios1. Only PCPG baseline2. Decoupled PCPG+DVFS
• Five scenarios evaluated:1. No power management2. PCPG Only3. DVFS Only4. PCPG + DVFS baseline5. PAMPA our approach
• Single-thread performance- andthroughput-oriented benchmarks:• SPECpower_ssj2008• Four PARSEC applications• Two HPC applications
• For Lulesh: PCPG+DVFS degrades
performance excessively PAMPA preserves
performance
0.00.51.01.52.02.53.03.5
2468
10121416
0.00.51.01.52.02.53.03.5
Freq
uenc
y (G
Hz)
2468
10121416
On
Core
s
Summary
performance Prevents frequency
downscaling
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
to P
CPG
Fluidanimate
SpeedupSystem PowerEfficiency
! ! ! !
Performance
02
1 4 8 16 Average
Software Threads
02
1 8 27 64 Average
Software Threads
No-PM PCPG-Only DVFS-Only PCPG+DVFS PAMPA
• Modern systems incorporate multipleactuators for dynamic power management
• Algorithms that control these actuators have evolved independently• Their independent operation can result in conflicting decisions
that can lead to undesirable effects on performance and power• We propose a coordinated, in-order actuation of the power knobs
Why should I use PAMPA? Exhibits power-performance efficiencies comparable to
the most aggressive, decoupled PCPG+DVFS approach2.2
3.3
4.4
8
12
16
Freq
uenc
y (G
Hz)
On
Core
Cou
nt
Fluidanimate
0.0
8 16 32 64
Software Threads
EfficiencyPerformanceis hurt at
unacceptable levels!
• We propose a coordinated, in-order actuation of the power knobs
• PAMPA: Power-Aware Management of Processor Actuators• Dynamically “detects” if an application is single-thread performance
or throughput bound and actuates the knobs accordingly
the most aggressive, decoupled PCPG+DVFS approach
Avoids excessive performance degradation in cases where PCPG+DVFS results in conflicting decisions0.0
1.1
2.2
0
4
8
8 16 32 64
Freq
uenc
y (G
Hz)
On
Core
Cou
nt
Software Threads
On CoresFreq. (GHz)