integrated modeling challenges in extreme-scale...
TRANSCRIPT
ISPASS-2011 Keynote; April 12, 2011
T.J. Watson Research Center
© 2011 IBM Corporation
Integrated Modeling Challenges inExtreme-Scale Computing
Pradip Bose
IBM T. J. Watson Research Center
T.J. Watson Research Center
© 2011 IBM Corporation2 Pradip Bose, ISPASS-2011 Keynote
Outline of Talk
� Introduction
– Setting the context: a view of future extreme-scale computing
– What is the primary “wall”: power or reliability?
– Why is pre-silicon modeling a grand challenge in itself?
� Integrated Modeling
– Power/Temperature, Performance, Reliability
– Levels of Abstraction in Integrated Modeling
• Relative versus absolute accuracy issues
– Multi-core power and reliability-aware definition; dynamic management
• Selected examples to illustrate the modeling complexities
� Concluding Remarks
© 2011 IBM Corporation
What is Extreme Scale Computing?
� Exa- refers to 1018, which is 1000x Peta-• Exascale refers to a system that can handle a million trillion
operations per second
� Various government agencies have identified exascaleas a critical need in the 2018-2020 timeframe
� In scientific communities, the important operation is one floating point operation or calculation• Exascale in this context refers to 1018 flops
• IBM Roadrunner system: peak of 1 petaflops in 2008
- Top-ranked system in “Top500” list back in 2008/2009• IBM’s Blue Gene product family: L, P, Q systems have consistently been
dominant players in the “Top500” and “Green500” lists.
• So Exascale demands a ~1000x improvement in throughput in 10 years
Petascale and Exascale Systems
Ref: recent tutorial article by Josep Torrellas, “Architectures for Extreme Scale Computing,”
IEEE Computer, Nov. 2009, pp. 28-35
P. Bose, ISPASS-2011 Keynote
© 2011 IBM Corporation4
Whole Organ Simulation
Low Emission Engine DesignTumor Modeling
Smart Grid
CO2 Sequestration
Nuclear EnergyLi/Air Batteries
Many Examples of BIG Applications that Need Extreme Scale Computing
Li
Anode
Li+
solvated Li ion
(aqueous case)
O2
Air Cathode
#1 #2
#3 #4
Li+
Smart Buildings
P. Bose, ISPASS-2011 Keynote
IBM Research
© 2006 IBM Corporation 5Pradip Bose ISPASS-2011 Keynote
The Power Wall ����Transition to New Technology
15X ooooPower
3-4X ooooTransistor Speed
Bipolar to CMOS Transition
50X mmmmDensity 3-10X mmmmDensity
10X ooooPower
Traditional CMOS to 3D CMOS
3X ooooTransistor Speed
Year of Announcement
1950 1960 1970 1980 1990 2000 2010
Module
Heat F
lux(w
att
s/c
m2)
0
2
4
6
8
10
12
14
Bipolar
CMOS
VacuumIBM 360
IBM 370 IBM 3033
IBM ES9000
Fujitsu VP2000
IBM 3090S
NTT
Fujitsu M-780
IBM 3090
CDC Cyber 205IBM 4381
IBM 3081Fujitsu M380
IBM RY5
IBM PWR4
IBM RY6
Apache
Pulsar
Merced
IBM RY7
IBM RY4
Pentium II(DSIP)
T-Rex
Squadrons
Pentium 4
Mckinley
Prescott
Jayhawk(dual)
? Opportunity
for 3D Si
6© 2011 IBM Corporation
Power-Performance Wall �
Multi-Cores for the Processor Chip
Time
Socket
Perform
ance
1 Core
2 Core
3 Core
4 Core
Power Density (a.u)
0.010.11
0.001
0.01
0.1
1
10
100
1000
Gate Length (microns)
Active Power
Passive Power
1994 2005
Gate Length (microns)
Gate Leakage
1 0.1 0.01
L3 Directory/Control
L2 L2 L2
LSU LSUIFU
BXU
IDU IDU
IFU
BXU
FPU FPU
FXU
FXUISU ISU
POWER4: 2001
180 nm, Cu, SOI
2 cores / chip
POWER 4+:
130 nm
POWER5: 2004
130 nm, Cu, SOI
2 cores / chip
2 way SMT / core
POWER5+: 90nm
Heterogeneous
multi-core chips
POWER7: 2010
45nm, Cu, SOI
8 cores/chip
4-way SMT/core
..
The Cell Processor Chip
The PowerEN Chip, 2010Homogeneous
Time
Socket
Perform
ance
1 Core
2 Core
3 Core
4 Core
Power Density (a.u)
0.010.11
0.001
0.01
0.1
1
10
100
1000
Gate Length (microns)
Active Power
Passive Power
1994 2005
Gate Length (microns)
Gate Leakage
1 0.1 0.01
Heterogeneous
P. Bose, ISPASS-2011 Keynote
7© 2011 IBM Corporation
The Power Wall: A View of the Supercomputer Arena
Oxide thickness is near the limit in late CMOS design era – Density improvements will continue but… power efficiency from technology will only improve very slowly.
– Historic trend of power efficiency improvement will slow
Nov 2009 Green 500 List:
If the world’s most power
efficient supercomputer is
extrapolated to a sustained
Exaflop (by 2018), power
would be …
~ 2 GigaWatts
IBM has been a leader in
large systems energy
efficiency, but meeting the
exascale goals is nothing
short of a very grand challenge!
BG/P Compute Chip, 2007
National Medal of Technology & Innovation
October 2009
Blue Gene Supercomputers
• 4 PPC-440 cores, 850 MHz
• IBM 90nm CMOS ASIC
• 173 sq. mm.
• 208 million transistors
• 16 W
System-on-a-Chip (SoC)
IBM [Blue Gene/P]France252378.779
IBM [BlueGene/P]United Arab Emirates504378.779
China1484.8379.248
Japan51.2428.917
IBM [BladeCenter QS22]DOE/NNSA/LANL (USA)2345.5444.256
IBM [BladeCenter QS22]IBM Poughkeepsie (USA)138458.334
IBM [BladeCenter QS22]DOE/NNSA/LANL (USA)276458.334
IBM [QPACE]Germany59.49722.981
IBM [QPACE]Germany59.49722.981
IBM [QPACE]Germany59.49722.981
BrandSupercomputer
Location
KiloWattsMFLOPS
per Watt
Rank
Data from: http://www.green500.org
8© 2011 IBM Corporation
Hybrid Systems – Workload Optimized
� General purpose commercial servers have been on a 2X performance every 2 years curve
� But special-purpose HPC supercomputers have been on a ~4X performance every 2 years curve
� Power-efficient accelerator sub-cores for special-purpose functions constitute the vision of workload-optimized hybrid systems of the future –esp. in emerging new application domains
–Games market and the Cell multi-core heterogeneous chip was an early trend setter
Nambiar et al., TPCTC 2010, LNCS 6417, 2011
P. Bose, ISPASS-2011 Keynote
T.J. Watson Research Center
© 2011 IBM Corporation9 Pradip Bose| ISPASS-2011 Keynote
Active Power Reduction via Concurrency: The Classic Argument
Ack: Shekhar Borkar, Intel, 2005 conf. talk
� A key principle in use in large-scale parallel HPC systems
� Cost constraint for an exascale-regime system implies:
• manageable number of compute nodes � dozens of cores/chip
� Also, cannot forget the serial (Amdahl) component of HPC codes!
T.J. Watson Research Center
© 2011 IBM Corporation10 Pradip Bose, ISPASS-2011 Keynote
Application-Driven Dynamic Resource Management
� Multi-dimensional tradeoff analysis
and design space exploration
across targeted workloads requires
the support of careful, application-
driven, dynamic management
capability
• Power Shifting across compute,
communication and storage resources
• Wear-leveling (proactive redundancy) to
increase lifetime (MTBF): J. Shin et al.
ISCA-2008
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Relative
resource
utilization
Phase 1 Phase 2 Phase 3 Phase 4
Application Phases
Compute Storage Communication
Dynamic power-gating or
DVFS features needed to implement
power shifting or wear-leveling mechanism
….
T.J. Watson Research Center
© 2011 IBM Corporation11 Pradip Bose, ISPASS-2011 Keynote
Application-Driven Dynamic Resource Management
� Multi-dimensional tradeoff analysis
and design space exploration
across targeted workloads requires
the support of careful, application-
driven, dynamic management
capability
• Power Shifting across compute,
communication and storage resources
• Wear-leveling (proactive redundancy) to
increase lifetime (MTBF): J. Shin et al.
ISCA08
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Relative
resource
utilization
Phase 1 Phase 2 Phase 3 Phase 4
Application Phases
Compute Storage Communication
Dynamic power-gating or
DVFS features needed to implement
power shifting or wear-leveling mechanism
….
T.J. Watson Research Center
© 2011 IBM Corporation12 Pradip Bose, ISPASS-2011 Keynote
Application-Driven Dynamic Resource Management
� Multi-dimensional tradeoff analysis
and design space exploration
across targeted workloads requires
the support of careful, application-
driven, dynamic management
capability
• Power Shifting across compute,
communication and storage resources
• Wear-leveling (proactive redundancy) to
increase lifetime (MTBF): J. Shin et al.
ISCA08
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Relative
resource
utilization
Phase 1 Phase 2 Phase 3 Phase 4
Application Phases
Compute Storage Communication
Dynamic power-gating or
DVFS features needed to implement
power shifting or wear-leveling mechanism
….
© 2011 IBM Corporation13
Reliability and Availability: The Other “Wall”
Hardware Failures
Software Failures
� Brute force techniques (checkpointing) may not be feasible due to disk bandwidth
� Time to checkpoint may dominate computation
� Need to look at reliability at the application level
Massive numbers, advanced technologies, and quantity of data produce
reliability issues in both hardware and software.
� One million compute nodes, each with a 10 year MTBF would constitute a system that that is likely to fail every 5 minutes
0
100
200
300
400
500
600
700
800
IA64
X86
Power5
Blue Gene
127
1
394
800
Better than 100 times lower failure rate for
equivalent performance
Failures / Month @ 100 TF (data from ANL Survey)
*ANL = Argonne National Lab
http://www.er.doe.gov/ASCR/ASCAC/Meetings/Aug06/Stevens.pdf
e.g. SWAT project at UIUC (Sarita Adve’s group)
Key point: processors targeted for smaller-size systems are usually not
suitable for building large-scale supercomputing systems
T.J. Watson Research Center
© 2011 IBM Corporation14 Pradip Bose| ISPASS-2011 Keynote
In fact…reliability is (quite possibly) the primary wall !
Performance
Reliability
Energy Efficiency
Speedup
Number of processors (N)
If RN increases with N
RN = MTTR/MTTF (recovery overhead)
for a N-way system
(See Meeta Gupta et al., MICRO-2009 for local vs. global recovery sensitivities at chip level)
A reliability-unaware extreme scale design may not even be able to complete
a benchmark workload (e.g. Linpack), even with an unconstrained power
budget because of too frequent errors and consequent rollbacks!
T.J. Watson Research Center
© 2010, IBM Corporation15 Pradip Bose ISPASS-2011 Keynote
90nm 65nm 45nm 32nm 22nm
technology node
device fail rate
(unmasked)
SER Variability Ldi/dt NBTI
Just a cartoon: not real data !�Chip-level functional robustness likely to decline in future
� Increase in transient errors and hard faults
� Maintaining historic levels of chip-level MTBF: cost-
prohibitive
� Burn-in difficulty, cost due to high power regime
� Thermal hot spots are a new source of transient/hard
failures
� System-level reliability targets: going to be hard to meet
� Two “system” examples:
� SoC with hundreds of core / non-core elements
� Large HPC system with thousands or millions of
processor cores/chips [extreme scale computing]
� Need new cost-effective solutions across the entire
h/w-s/w system design stack to meet FIT targets at
any given level of “system” abstraction
� Design and analysis tools must evolve as well
Chip Level Reliability
Cost implication trend: not sustainable!
020040060080010001200140016001800
MTTF in
days
0.5
FITs
1.0
FITs
1.5
FITs
2.0
FITs
10.0
FITs
Failure rate for each core
50,000 100,000 500,000 1,000,000 No. of cores
T.J. Watson Research Center
© 2011 IBM Corporation16 Pradip Bose, ISPASS-2011 Keynote
Chip/System Level Definition (Modeling) Approaches
A few specific examples
T.J. Watson Research Center
© 2008 IBM Corporation17 Pradip Bose| ISPASS-2011 Keynote
Towards an integrated modeling infrastructure
Power ModelingEnhancements
TemperatureModeling
Uniprocessor CPI and Power sensitivities
Package RLC models,Ldi/dt analysis
Substrate Processor Simulator
PowerTimer: core-level modeling
Reliability Modeling
Multi-Core Power-Performance Modeling
chip-level microarchitecture modeling
VALIDATION
System interconnect and tech.scaling parameters, models
Latch-counts + array power models
Latch-counts + scaled CPAM based models + refined array power models
Trace/exec driven simulation
To Interconnect
Layer Thermal Model
Heat Sink Silicon Die
Heat Spreader
Thermal Interface Material
Fin-to-air convection thermal resistor
L2
C7
L2
C0
L2L2
C4
CCC8
Data from device and
circuit level
Program traces
Architecturalderating factor
Cycle acc.ProcessorSimulator
Soft errormodel
microarch design
and definition
(ref: IBM Journ. R&D, Sep/Nov 2003)
• Toolset evolved: 2000-2008
• Not as integrated as one would like!
•Detailed and slow!
T.J. Watson Research Center
© 2011 IBM Corporation18 Pradip Bose, ISPASS-2011 Keynote
The Pre-Silicon Modeling Challenge in Extreme Scale Systems� Why is this a grand challenge in itself?
� Because the constraints are multi-
dimensional, interdependent and
extremely hard to meet at affordable
cost. Example:
• 20 MW system power
• 1 exaflops sustained performance
• MTBF of at least two weeks, preferably 1 month
� And, because cycle-accurate simulation
speed is not scaling up
– Host hardware (simulation platform)
speed is not increasing
– Number of cores and target MIPS is
increasing exponentially
– Cycle-accurate performance simulators
are very hard to parallelize
Performance
Power
Reliability
T.J. Watson Research Center
© 2008 IBM Corporation19 Pradip Bose| ISPASS-2011 Keynote
Early Chip Planner Framework at IBM Watson
A step toward better integration of component models
Jeonghee Shin, John Darringer et al.
Pradip Bose ISPASS-2011 Keynote20
Phased Power Modeling Methodology
� Concept ���� HLD ���� Implementation Phase
PreviousGenerationDatabase
ScaledArchitecture
PowerModels
MPwrSCHSim(circuitpower)
RTLSim(data switch
factors)
Benchmarks(e.g. SPEC)
MSimperformance
model
Designer
PerformanceValidation
Event &Instr Freq
DesignTechnologyParameters
Gator(calc CGFs)
Unit LevelClock Gating
EfficiencyEstimate
Clocking Conditions(event expressions)
Power Projectionfor Given Workload
CurrentDatabase
VHDLContract
GatorTable
2000+
pstats
H. Jacobson et al., HPCA-17, 2011
Pradip Bose ISPASS-2011 Keynote21
Power Model Requirements in the Many-Core System Era
� Core-level abstraction is a must (for speed)
– Facilitates multi-core DPM algorithm studies
– Also, fast power-perf tradeoff analyses for core
� But… detailed reference model useful for
macro-wise power budgeting and tracking
– Core power projection accuracy is important
� POWER7 chip-specific model
– Detailed p7 reference power model
– Formal attribute selection method
– Support for microarchitecture scalability
Modelruntime
Chip
Core
Macro
System
Modelaccuracy
� Linear regression based abstraction is a
very useful technique
– H. Jacobson et al., HPCA-17, 2011
• See also: previous work: Powell et al. (HPCA
2010), Lee and Brooks (ASPLOS 2006)
Pradip Bose ISPASS-2011 Keynote22
Reference Power Model
� p7 microprocessor chip
– High frequency aggressive
superscalar out-of-order design
• 32-thread, 8 core, 32kB I/D
caches, 256kB L2 cache
� p7 core reference power model
• Suitable for macro-level
power analysis, tracking
• 2300 µarch stats
• 500 RTL macros
• 2800 modeled clock/port/data
gating domainsPOWER7 (p7) Core + L2
H. Jacobson et al., HPCA-17, 2011
Pradip Bose ISPASS-2011 Keynote23
Power Model Abstraction
� Abstract model obtained through linear regression
– 15,000+ sets of event stats obtained from simulation of Spec2k6,
Commercial, Multimedia, and other workloads
15k data points
Regression
Stats/PowerC
Abstracted Power Model
– Power calculated using reference
model for each set of event stats
– Linear regression performed to
create abstract power model
• power = C0 + C1*S1 + … + Cn*Sn
– 10/90 coverage test used to validate
the final power model
MSim/Gator
H. Jacobson et al., HPCA-17, 2011
Pradip Bose ISPASS-2011 Keynote24
Attribute Relation to Power Variance
� A few attributes explain most of power variance
– First 8 principal component attributes explain 99% of variance
– Not necessarily the best for intuitive understanding by humans or
ease of implementation
Attribute
& P
ow
er
Corr
ela
tion
Attributes
Explained % of Variance100
900 2525000
0
1
# of Principal Components
H. Jacobson et al., HPCA-17, 2011
Pradip Bose ISPASS-2011 Keynote25
The Importance of Selecting the Right Attributes
� Single attribute (1)
– IPC fitness corr. 0.905
– Significant error spread
� Random attributes (8)
– Best fit corr. 0.976
– Worst fit corr. 0.109
Fitness correlation
Prediction error (Test)
-20 150
� Domain experts (8)
– Expert A fit corr. 0.968
– Expert B fit corr. 0.971
0 5-5
Prediction error (Test)
Fitness correlation
Prediction error (Test)
Fitness correlation
10-10 0
� Conclusion– Need systematic approach to select
high quality attributes
– See HPCA-17 paper for details
26
Adaptive Energy Management Features of the POWER7TM Processor (M. Floyd et al., Hot Chips-22)
* Statements regarding EnergyScale features do not imply that IBM will introduce a system with this capability
L 3
L 2
VSU
&
FPU
ISU
IFU
LSU
FXU
NCU
CORE
DFU
Method:� For each functional unit, pick small subset of activities to infer
power consumption (e.g. cache & regfile reads & writes,
execution pipeline issue)
� Weight each activity to represent how much relative power it
consumes
= Activity Sense point
Processor Core Chiplet
4 events Power
Proxy
Core
Activity
5 events
Goal:Estimate per-core chiplet power that we cannot directly measure
Processor Core Power Proxy: A Hardware Feature in p7
� Combine weighted Core, L2, and L3 activity, then add
constant offset plus clock grid power to form:
Chiplet Active Power = ∑ (Wi * Ai ) + C + K*f
Result:� EnergyScale Firmware adjusts this value for effects of
leakage, temperature, and voltage
Hardware design was driven by power model abstraction research at IBM Watson (A. Buyuktosunoglu et al.)
IEEE Micro, 2011 (to appear)
IBM J. R&D, vol. 55, no. 3, 2011
P. Bose, ISPASS-2011 Keynote
27
Adaptive Energy Management Features of the POWER7TM Processor (M. Floyd et al., Hot Chips-22)
* Statements regarding EnergyScale features do not imply that IBM will introduce a system with this capability
Power Proxy Measurements
� EnergyScale firmware budgets power across multiple processors and memory, used to:
� Shift power to cores or other components (e.g. memory) that need it the most(Especially important to achieve higher overall performance under a power cap)
� Enable Server Partition power accounting
IEEE Micro, 2011 (to appear)
IBM J. R&D, vol. 55, no. 3, 2011
P. Bose, ISPASS-2011 Keynote
T.J. Watson Research Center
© 2010 IBM Corporation28 Pradip Bose, ISPASS-2011 Keynote
Pitfalls of Architectural Abstractions: An Example from Soft Error Rate (SER) Analysis
2.0E10
5.0E10
2.0E11
5.0E11
2.0E12
5.0E12 2
8 5K 50K
500K0%
20%
40%
60%
80%
100%
Relative Error
N*S
C
gzip
1.0
E5
1.0
E6
1.0
E7
1.0
E8
1.0
E9
1.0
E10 2
8 5K 50
K
50
0K0%
20%
40%
60%
80%
100%
Relative Error
N*S
C
Day workload
Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions ,
X. Li, S. V. Adve, P. Bose, and J. A. Rivers, Proc. of the Int’l. Conf. on Dependable Systems and Networks (DSN),
June 2007.
System SER = ∑ [AVF(i) * Raw_SER(i)] …. AVF + SOFR abstraction
Architectural Vulnerability Factor
Errors in AVF+SOFR-based estimation get very large, when
number of modeled cores, C in the system becomes very large,
or if the raw error rate of each of the N cores becomes very large
T.J. Watson Research Center
© 2010 IBM Corporation29 Pradip Bose, ISPASS-2011 Keynote
Power Model Calibration/Validation Methodology
Integrated Model
(Power, Temp, Perf)HotGen
Microbenchmark
Microbenchmark
Measurement
Simulation
Compare
parameter
file
e.g.FXU utilization target: 30 %
test case generation
calib
rate
calibrate
SIMP:
Actual chip with
IR camera
Zhigang Hu et al. 2005-06
H. Hamann et al. JSSC, Jan 07
30© IBM Corporation, 2011
POWER5 Hotspot Patterns
Thermal map Power map
-50 different workloads for POWER5 imaged & analyzed•HotGen microbenchmark generator tool
- observed significant differences in circuit utilization
(H. Hamann et al., ISSCC-2006)
P. Bose, ISPASS-2011 Keynote
T.J. Watson Research Center
© 2011 IBM Corporation31 Pradip Bose, ISPASS-2011 Keynote
Optimal Pipeline Depth: TPCC Workload
0
0.2
0.4
0.6
0.8
1
710131619222528313437
Total FO4 Per Stage
Relative to Optimal FO4
bips
bips^3/W
Power-performance optimal Performance optimal
V. Srinivasan et al., MICRO-2002
V. Zyuban et al., IEEETC, 8/2004
moves to deeper
pipeline depth for
SPEC workloads
Note: Optimal point on x-axis is the important output of
such an analysis model; y-axis value absolute accuracy not very important!
T.J. Watson Research Center
© 2011 IBM Corporation32 Pradip Bose, ISPASS-2011 Keynote
CMP Space Exploration Results
0
2
4
6
8
10
2 4 6 8 10 12 14 16 18 20
Number of Cores
BIPS
2MB/18FO4/4
400mm2, Cheap Thermal Package, CPU bound
benchmark
The optimal
core-count for
a given core
type
Yingmin Li, Zhigang Hu et al., HPCA 2006
Analytical or
hybrid models do quite
well in such scenarios
33
T.J. Watson Research Center
Chip-level Lifetime Reliability Analysis
L2
L2
L2
L2
FPU
FPU
ISU ISU
ISU ISU
FPU
BRU
FPU
FXU FXU
FXU FXU
LSU LSU
LSU LSU
L2C
BRU
BRU BRU
IFU IFU
IFU IFU
L2C L2C
L2CNCU NCU
L3DIR L3DIR
L3DIR L3DIR
MC GXFBC
NCU NCU
0 2 4 6 8 10 12 140
4
9
13
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Pow
er (W
)
0 2 4
6 8
10
12
14 0
3
7 10 1
4
55
60
65
70
75
Tem
pera
ture
(°C
)
0
2
4
6
8 10 12 14
0
2
5
71012
0.0
0.5
1.0
1.5
2.0
2.5
x10
3 F
OR
C EM
0 2 4 6 8 10 12 140
5
11
0
5
10
15
20
25
x10
6 FO
RC
NB
TI
0 2 4 6 8 10 12 140
5
11
0
1
2
3
4
x10
9 FO
RC
TD
DB
◊ Floorplan ◊ Power ◊ Temperature
◊ FIT due to EM ◊ FIT due to NBTI ◊ FIT due to TDDB
Jeonghee Shin et al., DSN-2007, ISCA-2008
P. Bose, ISPASS-2011 Keynote
34April 12, 2011Pradip Bose ISPASS-2011 Keynote
T.J. Watson Research Center
Power-Performance Tradeoffs (on-chip, global power management; DVFS): A Key Modeling Challenge!
� MaxBIPS within 1% of Oracle
� Verification complexity of multi-core power management algorithms – scalability – is a key issue [A. Lungu et al. MEMOCODE 2009]
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
14.0%
60% 70% 80% 90% 100%POWER BUDGET
PERF. DEGRADATION
PrioritypullHi_pushLo
MaxBIPSOracle
57%
67%
77%
87%
97%
POWER
60% 70% 80% 90% 100%POWER BUDGET
PrioritypullHi_pushLo
MaxBIPSOracle
C. Isci, A. Buyuktosunoglu et al.
MICRO-39, 2006
C. Isci et al., MICRO-2006
ISPASS-2011 Kenote: Pradip Bose
35
Activity migration [temperature-aware task scheduling]
reduces maximum on-chip temperatures
(a) DAXPY running on core 0
(b) DAXPY running on core 1
(c) DAXPY hopping every 7ms
Chip designs could leverage the
lower temperatures for higher
frequencies, lower-cost packaging or
enhanced reliability.
Scale: 1 ~ 3.3 Celsuis
J. Choi, C-Y, Cher et al., ISLPED07
ISPASS-2011 Kenote: Pradip Bose
36
Leveraging Spatial Heat Slack
Activity Migration reduces Hotspots
2.50.91.61.11.00.4-0.5-1.10.1% slow down
S u m m a ry : C o r e -h o p p in g (4 m s ) r e d u c e s m a x im um o n -c h ip
te m p e ra tu r e
5 .5
4 .2
3 .3
4 .9 5 .1
2 .2 2 .32 .0
3 .5
0 .0
1 .0
2 .0
3 .0
4 .0
5 .0
6 .0
7 .0
8 .0
9 .0
1 0 .0
daxp
y
apsi
fma3d
luca
s
swim
bzi
p2
twolf
vortex
vpr
W o rk lo a d s
Reduction In Temperatures
(Celsius)
m a x im u m d e ltate m p e ra tu re
J. Choi, C-Y, Cher et al., ISLPED07
Measurement-based analysis;
very hard to project accurately via
simulation
37P. Bose, ISPASS-2011 Keynote
Power Gating as a Dynamic Management Knob
� Power Gating (PG) is becoming an
essential actuation knob for dynamic
power management
� Header or footer transistor gates off power to
the “macro” during idle durations
� Applied at core-level (per-core PG) or within a
core at the unit-level
� PG is applicable to a broad range of compute
nodes that exhibit variable idle times
� Mobile, Desktop, Enterprise etc.
� Our end target is efficiency at all levels:
from chips, all the way through to the data
center level
VddSleep
Virtual Vdd
Logic Block.
.
.
.
37
Header Transistor Implementation
© 2007 IBM Corporation38 Reliability and Power-Aware Template Documentation
Pradip Bose ISPASS-2011 keynote
38
Methodology for Core-Level Power Gating Analysis
� Use bit-vector traces (utilization) from instrumented cycle-accurate perf. simulator
� Workloads: SPEC, other traces
� Implement trace driven simulator for power gating algorithms, obtain:
� Leakage power savings estimate
� Projected performance impact
� Assume constant performance impact of 3 cycles on wake-up
Projected
Performance
Impact
FXU0
FXU1
FPU0
FPU1
LSU0
LSU1
Instrumented P6-like simulator
. . .
0 1 1 0 1 0 …
1 1 1 1 0 0 …
0 1 1 1 0 0 …
0 0 1 0 0 0 …
… … … … … …
Utilization bit-vector traces
Benchmark
C Simulator of power gating
algorithms
Unit bit-vector
trace
Leakage
Power Saving
EstimateA. Lungu et al., ISLPED-2009
© 2007 IBM Corporation39 Reliability and Power-Aware Template Documentation
Pradip Bose ISPASS-2011 keynote
39
Power gate potential function of break-even point for FXU0 and FXU1 units
57.64
45.87
80.35
62.38
0
10
20
30
40
50
60
70
80
90
100
FXU0, FP benchmarks FXU0, INT Benchmarks FXU1, FP Benchmarks FXU1, INT Benchmarks
% Leakage Savings
Power Savings Potential for Power Gating of Functional Units
Break-even point
Power gate potential function of break-even point for LSU0 and LSU1 units
39.7746.85
60.6665.26
0
10
20
30
40
50
60
70
80
90
100
LSU0, FP benchmarks LSU0, INT Benchmarks LSU1, FP Benchmarks LSU1, INT Benchmarks
% Leakage Savings
24 22 20 18 16 14 12 10 8 6 4 2 0
Large Potential for Power Gating!A. Lungu et al., ISLPED-2009
© 2007 IBM Corporation40 Reliability and Power-Aware Template Documentation
Pradip Bose ISPASS-2011 keynote
40
Pitfalls of Current Power Gating Algorithms
� Idle interval prediction can be consistently wrong:
� => power gating algorithm consistently wastes powerinstead of saving
� Possible scenarios in loops
� Idle monitor failure � Idle detect 3, break-even 20
� Average leakage power loss 100%
� Utilization monitor failure� Utilization threshold 30%
� Average leakage power loss 98.5%
Idle Monitor Algorithm
-100
-400
-200
0
200
400
1 3 5 7 9 11 13 15 17 19 21
% Leakage Savings
% Leakage Power Savings Average Savings
Utilization Monitor Algorithm
-98.58
-400
-200
0
200
400
1 3 5 7 9 11 13 15 17 19
% Leakage Savings
% Leakage Power Savings Average Savings
Utilization Pattern
0
1
1 3 5 7 9 11 13 15 17 19 21
Cycles
Utilization
A. Lungu et al., ISLPED-2009
© 2007 IBM Corporation41 Reliability and Power-Aware Template Documentation
Pradip Bose ISPASS-2011 keynote
41
Projected performance impact of idle counter solution (FP benchmark)
11.52
02468
101214161820
nam
d
calc
ulix
deal
II
lbm
bwav
esca
ctus
ADM
gam
essge
msF
DTD
grom
acs
lesl
ie3d
milc
povr
ay
sopl
ex
sphi
nx3
tont
o wrf
zeus
mp
Avg
% Proj. Perform
ance Loss
15 13 11 9 7 5 3 Oracle Cycle by Cycle
Single Level Idle Detect Power Gating Algorithm
Power savings of idle counter solution function of idle_detect for FXU0 unit (FP benchmark)
28.81
-8.17
34.93
57.64
-20
0
20
40
60
80
100
nam
d
calc
ulix
deal
II
lbm
bwav
esca
ctus
ADM
gam
essge
msF
DTD
grom
acs
lesl
ie3d
milc
povr
ay
sopl
ex
sphi
nx3
tont
o wrf
zeus
mp
Avg
% Leakage Savings
Idle Detect
Inefficient Behavior
A. Lungu et al., ISLPED-2009
© 2007 IBM Corporation42 Reliability and Power-Aware Template Documentation
Pradip Bose ISPASS-2011 keynote
42
Two Level (Guarded) Power Gating Algorithms
� Observations:
� Efficiency requirement of power saving schemes: save power
� Single level idle prediction algorithms can behave incorrectly and waste power
� Target:
� Improve quality of power gating schemes by reducing or eliminating their risk of wasting power
� Idea:
� Add second level monitor to control enabling of power gating scheme
� Improve efficiency of power wasting cases without degrading power saving of the common case
Efficiency
Counters Enable
Estimate
Power
Savings
Decision
Enable = 1
Enable = 0
Cnt2++Cnt1++
Level 2: Monitor & Control
Level 1: Actuate
On Off_U Off_C
Off_U: Power gated, uncompensated
Off_C: Power gated, compensated
A. Lungu et al., ISLPED-2009
43P. Bose, ISPASS-2011 Keynote
Datacenter INFRA
-STRUCTURE
NETWORK
Power Gating Module
Core1 Core2 CoreN
PWR ON/OFF
Inco
min
g
Ta
sks
Resource Utilization,
Idle & Burst Distribution
#Cores ON/OFF
Unit-level PG
Power Gating in a Datacenter Setting
N. Madan et al., HPCA-17, 2011
44P. Bose, ISPASS-2011 Keynote
Problems with Core-Level Power Gating
44
Utiliz
ation
Time
t1 t2 t3Decide to PG
Wake up cores
Power Gating Module
Core1 Core2 CoreN
PWR ON/OFF
Cannot be aggressive with PG as
penalties can be huge
Cannot be overly conservative as
power saving potential is lost
t4
Aggressive PG BAD!
Conservative PG BAD!
N. Madan et al., HPCA-17, 2011
45P. Bose, ISPASS-2011 Keynote
NETWORK
Power Gating Module
Core1 Core2 CoreN
PWR ON/OFF
Inco
min
g
Ta
sks
Guarded Gating Module
Resource Utilization,
Idle & Burst Distribution
#Cores ON/OFF
Unit-level PG
Perf Loss%
#Wake-ups
(Dis/En)able Gating
Augmenting Core-Level Power Gating
with Guarding
N. Madan et al., HPCA-17, 2011
46P. Bose, ISPASS-2011 Keynote
Proposed Guard Mechanism
� Monitor system response time
� Response time can be very high
when the system is overly utilized
� Monitor number of core wake-ups
� Wake-up latency and switching
power can be negligible too
� Only If both monitors show
unacceptable behavior
� Disable power manager
� Re-enable power manager after
a programmable time period
� Alert the system manager
46
Monitor 1
Performance
(Response time)
#Core Wake-ups
Power Gating
Manager
Monitor 2
Safe
Workload
Conditions
Enable = 0 Enable = 1
Count++
Monitor 3
Frequency of
Enable/Disable
(Count)
Inform
System
Administrator
Guard Mechanism
See N. Madan et al., HPCA-17, 2011 for Evaluation Results
More coverage at: Energy-Secure Architectures: Tutorial at ISCA-2011
47P. Bose, ISPASS-2011 Keynote
Power Gating Module
(IdlePG, UtilPG)
Core1 Core2 CoreN
PWR ON/OFF
Inco
min
g
Ta
sks
#Cores ON/OFF
Queuing Model Based Evaluation
FrameworkT
asks w
ith E
xp
ired
Tim
e S
lice
See N. Madan et al.,
HPCA-17, 2011 for
evaluation results
IBM Research
© 2011 IBM Corporation48 Pradip Bose ISPASS-2011 Keynote
Concluding Remarks
� Power and Reliability Walls are Key Impediments to Realization of Extreme Scale Computing Targets of the Future
– Reliability may well be the more fundamental obstacle beyond a certain size of the system
� Integrated Models (power/temperature, performance, reliability) are a Grand Challenge
– Analytical abstraction methods are essential for speed
– Yet, accuracy requirements at core/chip and other component level are more stringent than ever because of the implications of the huge scale (system size)
IBM Research
© 2011 IBM Corporation49 Pradip Bose ISPASS-2011 Keynote
Estab. 1986Estab. 1961 Estab. 1955
Estab. 1995
Estab. 1995Estab. 1972
Estab. 1998
Estab. 1982
Thank you!
Estab. 2010