integrated modeling challenges in extreme-scale...

ISPASS-2011 Keynote; April 12, 2011

T.J. Watson Research Center

© 2011 IBM Corporation

Integrated Modeling Challenges inExtreme-Scale Computing

Pradip Bose

IBM T. J. Watson Research Center

[email protected]


© 2011 IBM Corporation2 Pradip Bose, ISPASS-2011 Keynote

Outline of Talk

� Introduction

– Setting the context: a view of future extreme-scale computing

– What is the primary “wall”: power or reliability?

– Why is pre-silicon modeling a grand challenge in itself?

� Integrated Modeling

– Power/Temperature, Performance, Reliability

– Levels of Abstraction in Integrated Modeling

• Relative versus absolute accuracy issues

– Multi-core power and reliability-aware definition; dynamic management

• Selected examples to illustrate the modeling complexities

� Concluding Remarks

© 2011 IBM Corporation

What is Extreme Scale Computing?

� Exa- refers to 1018, which is 1000x Peta-• Exascale refers to a system that can handle a million trillion

operations per second

� Various government agencies have identified exascaleas a critical need in the 2018-2020 timeframe

� In scientific communities, the important operation is one floating point operation or calculation• Exascale in this context refers to 1018 flops

• IBM Roadrunner system: peak of 1 petaflops in 2008

- Top-ranked system in “Top500” list back in 2008/2009• IBM’s Blue Gene product family: L, P, Q systems have consistently been

dominant players in the “Top500” and “Green500” lists.

• So Exascale demands a ~1000x improvement in throughput in 10 years

Petascale and Exascale Systems

Ref: recent tutorial article by Josep Torrellas, “Architectures for Extreme Scale Computing,”

IEEE Computer, Nov. 2009, pp. 28-35

P. Bose, ISPASS-2011 Keynote

© 2011 IBM Corporation4

Whole Organ Simulation

Low Emission Engine DesignTumor Modeling

Smart Grid

CO2 Sequestration

Nuclear EnergyLi/Air Batteries

Many Examples of BIG Applications that Need Extreme Scale Computing

Li

Anode

Li+

solvated Li ion

(aqueous case)

O2

Air Cathode

#1 #2

#3 #4

Li+

Smart Buildings


IBM Research

© 2006 IBM Corporation 5Pradip Bose ISPASS-2011 Keynote

The Power Wall ��Transition to New Technology

15X ooooPower

3-4X ooooTransistor Speed

Bipolar to CMOS Transition

50X mmmmDensity 3-10X mmmmDensity

10X ooooPower

Traditional CMOS to 3D CMOS

3X ooooTransistor Speed

Year of Announcement

1950 1960 1970 1980 1990 2000 2010

Module

Heat F

lux(w

att

s/c

m2)

0

2

4

6

8

10

12

14

Bipolar

CMOS

VacuumIBM 360

IBM 370 IBM 3033

IBM ES9000

Fujitsu VP2000

IBM 3090S

NTT

Fujitsu M-780

IBM 3090

CDC Cyber 205IBM 4381

IBM 3081Fujitsu M380

IBM RY5

IBM PWR4

IBM RY6

Apache

Pulsar

Merced

IBM RY7

IBM RY4

Pentium II(DSIP)

T-Rex

Squadrons

Pentium 4

Mckinley

Prescott

Jayhawk(dual)

? Opportunity

for 3D Si

6© 2011 IBM Corporation

Power-Performance Wall �

Multi-Cores for the Processor Chip

Time

Socket

Perform

ance

1 Core

2 Core

3 Core

4 Core

Power Density (a.u)

0.010.11

0.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2005


Gate Leakage

1 0.1 0.01

L3 Directory/Control

L2 L2 L2

LSU LSUIFU

BXU

IDU IDU

IFU

BXU

FPU FPU

FXU

FXUISU ISU

POWER4: 2001

180 nm, Cu, SOI

2 cores / chip

POWER 4+:

130 nm

POWER5: 2004

130 nm, Cu, SOI

2 cores / chip

2 way SMT / core

POWER5+: 90nm

Heterogeneous

multi-core chips

POWER7: 2010

45nm, Cu, SOI

8 cores/chip

4-way SMT/core

..

The Cell Processor Chip

The PowerEN Chip, 2010Homogeneous

Time

Socket

Perform

ance

1 Core

2 Core

3 Core

4 Core

Power Density (a.u)

0.010.11

0.001

0.01

0.1

1

10

100

1000


Active Power

Passive Power

1994 2005


Gate Leakage

1 0.1 0.01

Heterogeneous



The Power Wall: A View of the Supercomputer Arena

Oxide thickness is near the limit in late CMOS design era – Density improvements will continue but… power efficiency from technology will only improve very slowly.

– Historic trend of power efficiency improvement will slow

Nov 2009 Green 500 List:

If the world’s most power

efficient supercomputer is

extrapolated to a sustained

Exaflop (by 2018), power

would be …

~ 2 GigaWatts

IBM has been a leader in

large systems energy

efficiency, but meeting the

exascale goals is nothing

short of a very grand challenge!

BG/P Compute Chip, 2007

National Medal of Technology & Innovation

October 2009

Blue Gene Supercomputers

• 4 PPC-440 cores, 850 MHz

• IBM 90nm CMOS ASIC

• 173 sq. mm.

• 208 million transistors

• 16 W

System-on-a-Chip (SoC)

IBM [Blue Gene/P]France252378.779

IBM [BlueGene/P]United Arab Emirates504378.779

China1484.8379.248

Japan51.2428.917

IBM [BladeCenter QS22]DOE/NNSA/LANL (USA)2345.5444.256

IBM [BladeCenter QS22]IBM Poughkeepsie (USA)138458.334

IBM [BladeCenter QS22]DOE/NNSA/LANL (USA)276458.334

IBM [QPACE]Germany59.49722.981



BrandSupercomputer

Location

KiloWattsMFLOPS

per Watt

Rank

Data from: http://www.green500.org


Hybrid Systems – Workload Optimized

� General purpose commercial servers have been on a 2X performance every 2 years curve

� But special-purpose HPC supercomputers have been on a ~4X performance every 2 years curve

� Power-efficient accelerator sub-cores for special-purpose functions constitute the vision of workload-optimized hybrid systems of the future –esp. in emerging new application domains

–Games market and the Cell multi-core heterogeneous chip was an early trend setter

Nambiar et al., TPCTC 2010, LNCS 6417, 2011



© 2011 IBM Corporation9 Pradip Bose| ISPASS-2011 Keynote

Active Power Reduction via Concurrency: The Classic Argument

Ack: Shekhar Borkar, Intel, 2005 conf. talk

� A key principle in use in large-scale parallel HPC systems

� Cost constraint for an exascale-regime system implies:

• manageable number of compute nodes � dozens of cores/chip

� Also, cannot forget the serial (Amdahl) component of HPC codes!



Application-Driven Dynamic Resource Management

� Multi-dimensional tradeoff analysis

and design space exploration

across targeted workloads requires

the support of careful, application-

driven, dynamic management

capability

• Power Shifting across compute,

communication and storage resources

• Wear-leveling (proactive redundancy) to

increase lifetime (MTBF): J. Shin et al.

ISCA-2008

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Relative

resource

utilization

Phase 1 Phase 2 Phase 3 Phase 4

Application Phases

Compute Storage Communication

Dynamic power-gating or

DVFS features needed to implement

power shifting or wear-leveling mechanism

….









capability





ISCA08

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Relative

resource

utilization


Application Phases





….

© 2011 IBM Corporation13

Reliability and Availability: The Other “Wall”

Hardware Failures

Software Failures

� Brute force techniques (checkpointing) may not be feasible due to disk bandwidth

� Time to checkpoint may dominate computation

� Need to look at reliability at the application level

Massive numbers, advanced technologies, and quantity of data produce

reliability issues in both hardware and software.

� One million compute nodes, each with a 10 year MTBF would constitute a system that that is likely to fail every 5 minutes

0

100

200

300

400

500

600

700

800

IA64

X86

Power5

Blue Gene

127

1

394

800

Better than 100 times lower failure rate for

equivalent performance

Failures / Month @ 100 TF (data from ANL Survey)

*ANL = Argonne National Lab

http://www.er.doe.gov/ASCR/ASCAC/Meetings/Aug06/Stevens.pdf

e.g. SWAT project at UIUC (Sarita Adve’s group)

Key point: processors targeted for smaller-size systems are usually not

suitable for building large-scale supercomputing systems



In fact…reliability is (quite possibly) the primary wall !

Performance

Reliability

Energy Efficiency

Speedup

Number of processors (N)

If RN increases with N

RN = MTTR/MTTF (recovery overhead)

for a N-way system

(See Meeta Gupta et al., MICRO-2009 for local vs. global recovery sensitivities at chip level)

A reliability-unaware extreme scale design may not even be able to complete

a benchmark workload (e.g. Linpack), even with an unconstrained power

budget because of too frequent errors and consequent rollbacks!


© 2010, IBM Corporation15 Pradip Bose ISPASS-2011 Keynote

90nm 65nm 45nm 32nm 22nm

technology node

device fail rate

(unmasked)

SER Variability Ldi/dt NBTI

Just a cartoon: not real data !�Chip-level functional robustness likely to decline in future

� Increase in transient errors and hard faults

� Maintaining historic levels of chip-level MTBF: cost-

prohibitive

� Burn-in difficulty, cost due to high power regime

� Thermal hot spots are a new source of transient/hard

failures

� System-level reliability targets: going to be hard to meet

� Two “system” examples:

� SoC with hundreds of core / non-core elements

� Large HPC system with thousands or millions of

processor cores/chips [extreme scale computing]

� Need new cost-effective solutions across the entire

h/w-s/w system design stack to meet FIT targets at

any given level of “system” abstraction

� Design and analysis tools must evolve as well

Chip Level Reliability

Cost implication trend: not sustainable!

020040060080010001200140016001800

MTTF in

days

0.5

FITs

1.0

FITs

1.5

FITs

2.0

FITs

10.0

FITs

Failure rate for each core

50,000 100,000 500,000 1,000,000 No. of cores



Chip/System Level Definition (Modeling) Approaches

A few specific examples



Towards an integrated modeling infrastructure

Power ModelingEnhancements

TemperatureModeling

Uniprocessor CPI and Power sensitivities

Package RLC models,Ldi/dt analysis

Substrate Processor Simulator

PowerTimer: core-level modeling

Reliability Modeling

Multi-Core Power-Performance Modeling

chip-level microarchitecture modeling

VALIDATION

System interconnect and tech.scaling parameters, models

Latch-counts + array power models

Latch-counts + scaled CPAM based models + refined array power models

Trace/exec driven simulation

To Interconnect

Layer Thermal Model

Heat Sink Silicon Die

Heat Spreader

Thermal Interface Material

Fin-to-air convection thermal resistor

L2

C7

L2

C0

L2L2

C4

CCC8

Data from device and

circuit level

Program traces

Architecturalderating factor

Cycle acc.ProcessorSimulator

Soft errormodel

microarch design

and definition

(ref: IBM Journ. R&D, Sep/Nov 2003)

• Toolset evolved: 2000-2008

• Not as integrated as one would like!

•Detailed and slow!



The Pre-Silicon Modeling Challenge in Extreme Scale Systems� Why is this a grand challenge in itself?

� Because the constraints are multi-

dimensional, interdependent and

extremely hard to meet at affordable

cost. Example:

• 20 MW system power

• 1 exaflops sustained performance

• MTBF of at least two weeks, preferably 1 month

� And, because cycle-accurate simulation

speed is not scaling up

– Host hardware (simulation platform)

speed is not increasing

– Number of cores and target MIPS is

increasing exponentially

– Cycle-accurate performance simulators

are very hard to parallelize

Performance

Power

Reliability



Early Chip Planner Framework at IBM Watson

A step toward better integration of component models

Jeonghee Shin, John Darringer et al.

Pradip Bose ISPASS-2011 Keynote20

Phased Power Modeling Methodology

� Concept �� HLD �� Implementation Phase

PreviousGenerationDatabase

ScaledArchitecture

PowerModels

MPwrSCHSim(circuitpower)

RTLSim(data switch

factors)

Benchmarks(e.g. SPEC)

MSimperformance

model

Designer

PerformanceValidation

Event &Instr Freq

DesignTechnologyParameters

Gator(calc CGFs)

Unit LevelClock Gating

EfficiencyEstimate

Clocking Conditions(event expressions)

Power Projectionfor Given Workload

CurrentDatabase

VHDLContract

GatorTable

2000+

pstats

H. Jacobson et al., HPCA-17, 2011


Power Model Requirements in the Many-Core System Era

� Core-level abstraction is a must (for speed)

– Facilitates multi-core DPM algorithm studies

– Also, fast power-perf tradeoff analyses for core

� But… detailed reference model useful for

macro-wise power budgeting and tracking

– Core power projection accuracy is important

� POWER7 chip-specific model

– Detailed p7 reference power model

– Formal attribute selection method

– Support for microarchitecture scalability

Modelruntime

Chip

Core

Macro

System

Modelaccuracy

� Linear regression based abstraction is a

very useful technique

– H. Jacobson et al., HPCA-17, 2011

• See also: previous work: Powell et al. (HPCA

2010), Lee and Brooks (ASPLOS 2006)


Reference Power Model

� p7 microprocessor chip

– High frequency aggressive

superscalar out-of-order design

• 32-thread, 8 core, 32kB I/D

caches, 256kB L2 cache

� p7 core reference power model

• Suitable for macro-level

power analysis, tracking

• 2300 µarch stats

• 500 RTL macros

• 2800 modeled clock/port/data

gating domainsPOWER7 (p7) Core + L2



Power Model Abstraction

� Abstract model obtained through linear regression

– 15,000+ sets of event stats obtained from simulation of Spec2k6,

Commercial, Multimedia, and other workloads

15k data points

Regression

Stats/PowerC

Abstracted Power Model

– Power calculated using reference

model for each set of event stats

– Linear regression performed to

create abstract power model

• power = C0 + C1*S1 + … + Cn*Sn

– 10/90 coverage test used to validate

the final power model

MSim/Gator



Attribute Relation to Power Variance

� A few attributes explain most of power variance

– First 8 principal component attributes explain 99% of variance

– Not necessarily the best for intuitive understanding by humans or

ease of implementation

Attribute

& P

ow

er

Corr

ela

tion

Attributes

Explained % of Variance100

900 2525000

0

1

# of Principal Components



The Importance of Selecting the Right Attributes

� Single attribute (1)

– IPC fitness corr. 0.905

– Significant error spread

� Random attributes (8)

– Best fit corr. 0.976

– Worst fit corr. 0.109

Fitness correlation

Prediction error (Test)

-20 150

� Domain experts (8)

– Expert A fit corr. 0.968

– Expert B fit corr. 0.971

0 5-5


Fitness correlation


Fitness correlation

10-10 0

� Conclusion– Need systematic approach to select

high quality attributes

– See HPCA-17 paper for details

26

Adaptive Energy Management Features of the POWER7TM Processor (M. Floyd et al., Hot Chips-22)

* Statements regarding EnergyScale features do not imply that IBM will introduce a system with this capability

L 3

L 2

VSU

&

FPU

ISU

IFU

LSU

FXU

NCU

CORE

DFU

Method:� For each functional unit, pick small subset of activities to infer

power consumption (e.g. cache & regfile reads & writes,

execution pipeline issue)

� Weight each activity to represent how much relative power it

consumes

= Activity Sense point

Processor Core Chiplet

4 events Power

Proxy

Core

Activity

5 events

Goal:Estimate per-core chiplet power that we cannot directly measure

Processor Core Power Proxy: A Hardware Feature in p7

� Combine weighted Core, L2, and L3 activity, then add

constant offset plus clock grid power to form:

Chiplet Active Power = ∑ (Wi * Ai ) + C + K*f

Result:� EnergyScale Firmware adjusts this value for effects of

leakage, temperature, and voltage

Hardware design was driven by power model abstraction research at IBM Watson (A. Buyuktosunoglu et al.)

IEEE Micro, 2011 (to appear)

IBM J. R&D, vol. 55, no. 3, 2011


27

Adaptive Energy Management Features of the POWER7TM Processor (M. Floyd et al., Hot Chips-22)

* Statements regarding EnergyScale features do not imply that IBM will introduce a system with this capability

Power Proxy Measurements

� EnergyScale firmware budgets power across multiple processors and memory, used to:

� Shift power to cores or other components (e.g. memory) that need it the most(Especially important to achieve higher overall performance under a power cap)

� Enable Server Partition power accounting

IEEE Micro, 2011 (to appear)

IBM J. R&D, vol. 55, no. 3, 2011




Pitfalls of Architectural Abstractions: An Example from Soft Error Rate (SER) Analysis

2.0E10

5.0E10

2.0E11

5.0E11

2.0E12

5.0E12 2

8 5K 50K

500K0%

20%

40%

60%

80%

100%

Relative Error

N*S

C

gzip

1.0

E5

1.0

E6

1.0

E7

1.0

E8

1.0

E9

1.0

E10 2

8 5K 50

K

50

0K0%

20%

40%

60%

80%

100%

Relative Error

N*S

C

Day workload

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions ,

X. Li, S. V. Adve, P. Bose, and J. A. Rivers, Proc. of the Int’l. Conf. on Dependable Systems and Networks (DSN),

June 2007.

System SER = ∑ [AVF(i) * Raw_SER(i)] …. AVF + SOFR abstraction

Architectural Vulnerability Factor

Errors in AVF+SOFR-based estimation get very large, when

number of modeled cores, C in the system becomes very large,

or if the raw error rate of each of the N cores becomes very large



Power Model Calibration/Validation Methodology

Integrated Model

(Power, Temp, Perf)HotGen

Microbenchmark

Microbenchmark

Measurement

Simulation

Compare

parameter

file

e.g.FXU utilization target: 30 %

test case generation

calib

rate

calibrate

SIMP:

Actual chip with

IR camera

Zhigang Hu et al. 2005-06

H. Hamann et al. JSSC, Jan 07

30© IBM Corporation, 2011

POWER5 Hotspot Patterns

Thermal map Power map

-50 different workloads for POWER5 imaged & analyzed•HotGen microbenchmark generator tool

- observed significant differences in circuit utilization

(H. Hamann et al., ISSCC-2006)




Optimal Pipeline Depth: TPCC Workload

0

0.2

0.4

0.6

0.8

1

710131619222528313437

Total FO4 Per Stage

Relative to Optimal FO4

bips

bips^3/W

Power-performance optimal Performance optimal

V. Srinivasan et al., MICRO-2002

V. Zyuban et al., IEEETC, 8/2004

moves to deeper

pipeline depth for

SPEC workloads

Note: Optimal point on x-axis is the important output of

such an analysis model; y-axis value absolute accuracy not very important!



CMP Space Exploration Results

0

2

4

6

8

10

2 4 6 8 10 12 14 16 18 20

Number of Cores

BIPS

2MB/18FO4/4

400mm2, Cheap Thermal Package, CPU bound

benchmark

The optimal

core-count for

a given core

type

Yingmin Li, Zhigang Hu et al., HPCA 2006

Analytical or

hybrid models do quite

well in such scenarios

33


Chip-level Lifetime Reliability Analysis

L2

L2

L2

L2

FPU

FPU

ISU ISU

ISU ISU

FPU

BRU

FPU

FXU FXU

FXU FXU

LSU LSU

LSU LSU

L2C

BRU

BRU BRU

IFU IFU

IFU IFU

L2C L2C

L2CNCU NCU

L3DIR L3DIR

L3DIR L3DIR

MC GXFBC

NCU NCU

0 2 4 6 8 10 12 140

4

9

13

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Pow

er (W

)

0 2 4

6 8

10

12

14 0

3

7 10 1

4

55

60

65

70

75

Tem

pera

ture

(°C

)

0

2

4

6

8 10 12 14

0

2

5

71012

0.0

0.5

1.0

1.5

2.0

2.5

x10

3 F

OR

C EM

0 2 4 6 8 10 12 140

5

11

0

5

10

15

20

25

x10

6 FO

RC

NB

TI

0 2 4 6 8 10 12 140

5

11

0

1

2

3

4

x10

9 FO

RC

TD

DB

◊ Floorplan ◊ Power ◊ Temperature

◊ FIT due to EM ◊ FIT due to NBTI ◊ FIT due to TDDB

Jeonghee Shin et al., DSN-2007, ISCA-2008


34April 12, 2011Pradip Bose ISPASS-2011 Keynote


Power-Performance Tradeoffs (on-chip, global power management; DVFS): A Key Modeling Challenge!

� MaxBIPS within 1% of Oracle

� Verification complexity of multi-core power management algorithms – scalability – is a key issue [A. Lungu et al. MEMOCODE 2009]

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

14.0%

60% 70% 80% 90% 100%POWER BUDGET

PERF. DEGRADATION

PrioritypullHi_pushLo

MaxBIPSOracle

57%

67%

77%

87%

97%

POWER

60% 70% 80% 90% 100%POWER BUDGET

PrioritypullHi_pushLo

MaxBIPSOracle

C. Isci, A. Buyuktosunoglu et al.

MICRO-39, 2006

C. Isci et al., MICRO-2006

ISPASS-2011 Kenote: Pradip Bose

35

Activity migration [temperature-aware task scheduling]

reduces maximum on-chip temperatures

(a) DAXPY running on core 0

(b) DAXPY running on core 1

(c) DAXPY hopping every 7ms

Chip designs could leverage the

lower temperatures for higher

frequencies, lower-cost packaging or

enhanced reliability.

Scale: 1 ~ 3.3 Celsuis

J. Choi, C-Y, Cher et al., ISLPED07

ISPASS-2011 Kenote: Pradip Bose

36

Leveraging Spatial Heat Slack

Activity Migration reduces Hotspots

2.50.91.61.11.00.4-0.5-1.10.1% slow down

S u m m a ry : C o r e -h o p p in g (4 m s ) r e d u c e s m a x im um o n -c h ip

te m p e ra tu r e

5 .5

4 .2

3 .3

4 .9 5 .1

2 .2 2 .32 .0

3 .5

0 .0

1 .0

2 .0

3 .0

4 .0

5 .0

6 .0

7 .0

8 .0

9 .0

1 0 .0

daxp

y

apsi

fma3d

luca

s

swim

bzi

p2

twolf

vortex

vpr

W o rk lo a d s

Reduction In Temperatures

(Celsius)

m a x im u m d e ltate m p e ra tu re

J. Choi, C-Y, Cher et al., ISLPED07

Measurement-based analysis;

very hard to project accurately via

simulation

37P. Bose, ISPASS-2011 Keynote

Power Gating as a Dynamic Management Knob

� Power Gating (PG) is becoming an

essential actuation knob for dynamic

power management

� Header or footer transistor gates off power to

the “macro” during idle durations

� Applied at core-level (per-core PG) or within a

core at the unit-level

� PG is applicable to a broad range of compute

nodes that exhibit variable idle times

� Mobile, Desktop, Enterprise etc.

� Our end target is efficiency at all levels:

from chips, all the way through to the data

center level

VddSleep

Virtual Vdd

Logic Block.

.

.

.

37

Header Transistor Implementation

© 2007 IBM Corporation38 Reliability and Power-Aware Template Documentation

Pradip Bose ISPASS-2011 keynote

38

Methodology for Core-Level Power Gating Analysis

� Use bit-vector traces (utilization) from instrumented cycle-accurate perf. simulator

� Workloads: SPEC, other traces

� Implement trace driven simulator for power gating algorithms, obtain:

� Leakage power savings estimate

� Projected performance impact

� Assume constant performance impact of 3 cycles on wake-up

Projected

Performance

Impact

FXU0

FXU1

FPU0

FPU1

LSU0

LSU1

Instrumented P6-like simulator

. . .

0 1 1 0 1 0 …

1 1 1 1 0 0 …

0 1 1 1 0 0 …

0 0 1 0 0 0 …

… … … … … …

Utilization bit-vector traces

Benchmark

C Simulator of power gating

algorithms

Unit bit-vector

trace

Leakage

Power Saving

EstimateA. Lungu et al., ISLPED-2009



39

Power gate potential function of break-even point for FXU0 and FXU1 units

57.64

45.87

80.35

62.38

0

10

20

30

40

50

60

70

80

90

100

FXU0, FP benchmarks FXU0, INT Benchmarks FXU1, FP Benchmarks FXU1, INT Benchmarks

% Leakage Savings

Power Savings Potential for Power Gating of Functional Units

Break-even point

Power gate potential function of break-even point for LSU0 and LSU1 units

39.7746.85

60.6665.26

0

10

20

30

40

50

60

70

80

90

100

LSU0, FP benchmarks LSU0, INT Benchmarks LSU1, FP Benchmarks LSU1, INT Benchmarks

% Leakage Savings

24 22 20 18 16 14 12 10 8 6 4 2 0

Large Potential for Power Gating!A. Lungu et al., ISLPED-2009



40

Pitfalls of Current Power Gating Algorithms

� Idle interval prediction can be consistently wrong:

� => power gating algorithm consistently wastes powerinstead of saving

� Possible scenarios in loops

� Idle monitor failure � Idle detect 3, break-even 20

� Average leakage power loss 100%

� Utilization monitor failure� Utilization threshold 30%

� Average leakage power loss 98.5%

Idle Monitor Algorithm

-100

-400

-200

0

200

400

1 3 5 7 9 11 13 15 17 19 21

% Leakage Savings

% Leakage Power Savings Average Savings

Utilization Monitor Algorithm

-98.58

-400

-200

0

200

400

1 3 5 7 9 11 13 15 17 19

% Leakage Savings

% Leakage Power Savings Average Savings

Utilization Pattern

0

1

1 3 5 7 9 11 13 15 17 19 21

Cycles

Utilization

A. Lungu et al., ISLPED-2009



41

Projected performance impact of idle counter solution (FP benchmark)

11.52

02468

101214161820

nam

d

calc

ulix

deal

II

lbm

bwav

esca

ctus

ADM

gam

essge

msF

DTD

grom

acs

lesl

ie3d

milc

povr

ay

sopl

ex

sphi

nx3

tont

o wrf

zeus

mp

Avg

% Proj. Perform

ance Loss

15 13 11 9 7 5 3 Oracle Cycle by Cycle

Single Level Idle Detect Power Gating Algorithm

Power savings of idle counter solution function of idle_detect for FXU0 unit (FP benchmark)

28.81

-8.17

34.93

57.64

-20

0

20

40

60

80

100

nam

d

calc

ulix

deal

II

lbm

bwav

esca

ctus

ADM

gam

essge

msF

DTD

grom

acs

lesl

ie3d

milc

povr

ay

sopl

ex

sphi

nx3

tont

o wrf

zeus

mp

Avg

% Leakage Savings

Idle Detect

Inefficient Behavior




42

Two Level (Guarded) Power Gating Algorithms

� Observations:

� Efficiency requirement of power saving schemes: save power

� Single level idle prediction algorithms can behave incorrectly and waste power

� Target:

� Improve quality of power gating schemes by reducing or eliminating their risk of wasting power

� Idea:

� Add second level monitor to control enabling of power gating scheme

� Improve efficiency of power wasting cases without degrading power saving of the common case

Efficiency

Counters Enable

Estimate

Power

Savings

Decision

Enable = 1

Enable = 0

Cnt2++Cnt1++

Level 2: Monitor & Control

Level 1: Actuate

On Off_U Off_C

Off_U: Power gated, uncompensated

Off_C: Power gated, compensated



Datacenter INFRA

-STRUCTURE

NETWORK

Power Gating Module

Core1 Core2 CoreN

PWR ON/OFF

Inco

min

g

Ta

sks

Resource Utilization,

Idle & Burst Distribution

#Cores ON/OFF

Unit-level PG

Power Gating in a Datacenter Setting

N. Madan et al., HPCA-17, 2011


Problems with Core-Level Power Gating

44

Utiliz

ation

Time

t1 t2 t3Decide to PG

Wake up cores

Power Gating Module

Core1 Core2 CoreN

PWR ON/OFF

Cannot be aggressive with PG as

penalties can be huge

Cannot be overly conservative as

power saving potential is lost

t4

Aggressive PG BAD!

Conservative PG BAD!



NETWORK

Power Gating Module

Core1 Core2 CoreN

PWR ON/OFF

Inco

min

g

Ta

sks

Guarded Gating Module

Resource Utilization,

Idle & Burst Distribution

#Cores ON/OFF

Unit-level PG

Perf Loss%

#Wake-ups

(Dis/En)able Gating

Augmenting Core-Level Power Gating

with Guarding



Proposed Guard Mechanism

� Monitor system response time

� Response time can be very high

when the system is overly utilized

� Monitor number of core wake-ups

� Wake-up latency and switching

power can be negligible too

� Only If both monitors show

unacceptable behavior

� Disable power manager

� Re-enable power manager after

a programmable time period

� Alert the system manager

46

Monitor 1

Performance

(Response time)

#Core Wake-ups

Power Gating

Manager

Monitor 2

Safe

Workload

Conditions

Enable = 0 Enable = 1

Count++

Monitor 3

Frequency of

Enable/Disable

(Count)

Inform

System

Administrator

Guard Mechanism

See N. Madan et al., HPCA-17, 2011 for Evaluation Results

More coverage at: Energy-Secure Architectures: Tutorial at ISCA-2011


Power Gating Module

(IdlePG, UtilPG)

Core1 Core2 CoreN

PWR ON/OFF

Inco

min

g

Ta

sks

#Cores ON/OFF

Queuing Model Based Evaluation

FrameworkT

asks w

ith E

xp

ired

Tim

e S

lice

See N. Madan et al.,

HPCA-17, 2011 for

evaluation results

IBM Research

© 2011 IBM Corporation48 Pradip Bose ISPASS-2011 Keynote

Concluding Remarks

� Power and Reliability Walls are Key Impediments to Realization of Extreme Scale Computing Targets of the Future

– Reliability may well be the more fundamental obstacle beyond a certain size of the system

� Integrated Models (power/temperature, performance, reliability) are a Grand Challenge

– Analytical abstraction methods are essential for speed

– Yet, accuracy requirements at core/chip and other component level are more stringent than ever because of the implications of the huge scale (system size)

IBM Research

© 2011 IBM Corporation49 Pradip Bose ISPASS-2011 Keynote

Estab. 1986Estab. 1961 Estab. 1955

Estab. 1995

Estab. 1995Estab. 1972

Estab. 1998

Estab. 1982

Thank you!

Estab. 2010

integrated modeling challenges in extreme-scale...

Documents