olay: combat the signs of aging with introspective reliability management authors: shuguang feng...

20
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke W-QUAD (ISCA-35) June 21, 2008

Upload: aiden

Post on 25-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke. W-QUAD (ISCA-35) June 21, 2008. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science1

Olay: Combat the Signs of Aging with Introspective Reliability Management

Authors: Shuguang FengShantanu GuptaScott Mahlke

W-QUAD (ISCA-35)June 21, 2008

Page 2: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science2

Motivation “Designing Reliable Systems from Unreliable

Components…”- Shekhar Borkar (Intel)

[Srinivasan, DSN‘04] [Borkar, MICRO‘05]

More failures to come Failures will be wearout induced

Page 3: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science3

Approaches to Reliability

3

DetectDetect DiagnoseDiagnose Repair/reconfigure/recoverRepair/reconfigure/recover

Architecture-level

MarginingMargining Robust cell topologiesRobust cell topologies

Circuit-level

Dynamic thermal mgmt (DTM)Dynamic thermal mgmt (DTM)

Introspective reliability mgmt (IRM)Introspective reliability mgmt (IRM)

High-K dielectricsHigh-K dielectrics PassivationPassivation

Prevent Faults (proactive)

Tolerate Faults (reactive)

or…

Approaches to Reliability

DivaDiva

ArgusArgusWDUWDU Heat-and-RunHeat-and-Run Reliability Banking

Reliability BankingRAMPRAMP

Targeted management based on wearout monitoring

Page 4: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science4

Not All Cores Are Created Equal Chip-multiprocessors will be subject to severe process

variation

Dynamic thermal/power budgeting can be suboptimal Temperature is only part of the picture Need low-level reliability awareness

Low-level sensors measure physical changes

Wearout-aware management improves reliability enhancement

System reconfiguration Dynamic voltage and frequency scaling (DVFS) Job assignment

Page 5: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science5

Introspective Reliability Management (IRM)

Filte

ring

and

Ana

lysi

s

Raw

Sen

sor D

ata

Agg

rega

te A

naly

sis

Proc

esse

d D

ata

Virtualization Layer Reliability Assesment

Management Decisions

OS

Scheduled Jobs IRM Policy

Low-level Sensors delay leakage temperature etc.

WDU [MICRO`07] measure propagation delay track statistical trends

Olay track the progression of wearout profile workload behavior generate wearout-aware job schedules

Page 6: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Per-module Reliability Profile

Activity:

6

Wearout-aware Scheduling

Active Jobs Available Cores

T0

T1

T2

T3

TnJob Schedule

T6

T8

T9

T2

Idle

T0

T10

Idle

T3

Idle

Idle

T7

T4

T11

T5

T1

T1

T10

T9

T2

T4

T0

T8

Idle

T3

Idle

T7

Idle

Idle

T11

T5

T6

T7

T10

T9

T2

Idle

T0

T8

Idle

T3

Idle

T1

T4

Idle

T6

T5

T11

75%75% 15%15% 25%25% 35%35%50%50% 25%25% 45%45% 5%5%10%10% 35%35% 25%25% 85%85%

Page 7: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science7

Wearout-aware Scheduling

Filte

ring

and

Ana

lysi

s

Raw

Sen

sor D

ata

Agg

rega

te A

naly

sis

Proc

esse

d D

ata

Virtualization Layer Reliability Assesment

OS

Scheduled Jobs IRM Policy

Job-to-Core Binding

Life Remaining

100% 0%

30%

50%

30%

25%

17%

35%

80%

17%

60%

55%

15%

75%

70%

85%

10%

8%

Lightweight

Strong

Heavyweight

Weak

Core

ApplicationT0

T1

T2

T3

Tn

Page 8: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Wearout-aware Policies GreedyE

Optimizes for early life performance Minimizes premature failures with wear-leveling

8

C0

C1

C2

C3

C4

Cn

Cores

T0

T1

T2

T3

T4

Tn

Jobs

C7

C6

C1

C3

C10

Cn

T12

T3

T9

T5

T4

Tn

C6

C1

C3

C10

C4

Cn

T4

T3

T9

T5

T7

Tn

C1

C3

C10

C4

C0

Cn

T13

T8

T9

T3

T5

Tn

T12T4

T13

T1

T5

T7

T8

T15

T11

T9

T6 T3

T10

T0

T2

T15

Weak

Strong

Light

Heavy

Schedule

Page 9: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Wearout-aware Policies GreedyE

Optimizes for early life performance Minimizes premature failures with wear-leveling

GreedyL Optimizes for end of life performance Victimizes weak cores to maximize the life of stronger

cores

GreedyA Hybrid of GreedyE and GreedyL Adapts behavior based on system utilization

9

Page 10: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Temperature TraceSynthetic Benchmarks representative of SPEC2000 suite reduces online profiling complexity

Offline Characterization

SPEC2000 (INT & FP)Execution TracePower Trace

10

Lifetime Reliability Simulation (FACE)

SimAlpha Wattch HotSpot

BenchmarkSuite

Benchmark Profiles

Page 11: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Reliability Management monitors CMP health wearout-aware scheduling

profiling intelligent heuristics

Simulate CMP Aging tracks progression of wearout mechanisms hierarchical design

Workload Generation emulates OS scheduler temperature traces power traces

Parameter Specification Device lifetimes Utilization pattern

Onl

ine

Sim

ulat

ion

11

Lifetime Reliability Simulation (FACE)Offline Characterization

SimAlpha Wattch HotSpot

BenchmarkSuite

Benchmark Profiles

Workload Simulator

CMP Simulator

Olay

Monte Carlo Engine

Page 12: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science12

Wearout Modeling Mean time to failure (MTTF)

defines distribution of device lifetimes

Damage accumulation

where α is the degradation rate

TE

NBTI

aNBTI

eV

MTTF

1

T

ZTTYXbTa

TDDB eV

MTTF

1

01011 11 DDD i

ninnn

i

quali MTTFMTTF

Page 13: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science13

CMP Reliability SimulationCMP

Core

Module

Transistors: multiple mechanisms evolve

independently

Modules: experience load-dependent stress smallest granularity of

temperature modeling

Cores: Alpha 21264-type processor

CMPs: variable number of cores model systematic variation

Transistor

Page 14: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Evaluation Policies

Random (baseline), GreedyE, GreedyL, GreedyA

Figures of merit Failure distribution Useful work performed prior to system failure

Varied system parameters CMP size System utilization Sensor error

14

Page 15: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science15

Failure Distribution

w/ 16-coresw/ 16-cores

Page 16: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science16

Sensitivity to System Utilization

w/ 16-coresw/ 16-cores

Page 17: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science17

Sensitivity to CMP Size

w/ 100% utilization & GreedyEw/ 100% utilization & GreedyE

Page 18: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science18

Sensitivity to Sensor Error

w/ 16-cores,100% utilization, & GreedyEw/ 16-cores,100% utilization, & GreedyE

Page 19: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science19

Conclusions Heterogeneity exists in both CMPs and their

workloads Wearout-aware job assignments effectively exploit

this heterogeneity Real-time health monitoring (low-level sensors)

CMPs augmented with Olay perform up to 20% more useful work

Proper high-level analysis and profiling is essential for enhancing lifetime reliability.

Page 20: Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science20

Questions?

?