samira khan, chris wilkerson, zhewang, alaaalameldeen ... · samira khan, chris wilkerson, zhewang,...

1
MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content Samira Khan, Chris Wilkerson, Zhe Wang, Alaa Alameldeen, Donghyuk Lee, Onur Mutlu University of Virginia, Intel, Nvidia, ETH Zurich CHALLENGE IN DETECTION VISION: SYSTEM-LEVEL DETECTION AND MITIGATION Unreliable DRAM Cells Detect and Mitigate Reliable System Detect and mitigate errors after the system has become operational ONLINE PROFILING Unreliable DRAM Cells BENEFITS OF ONLINE PROFILING Reliable DRAM Cells Technology Scaling 1. Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee Unreliable DRAM Cells BENEFITS OF ONLINE PROFILING LO-REF HI-REF HI-REF LO-REF LO-REF 2. Improves performance and energy efficiency Reduce refresh rate, refresh faulty rows more frequently Reduce refresh count by using a lower refresh rate, but use higher refresh rate for faulty cells 0 1 0 Some cells can fail depending on the data stored in neighboring cells FAILURE DETECTION IS HARD DUE TO INTERMITTENT FAILURES 1 1 1 NO FAILURE DATA-DEPENDENT FAILURE HOW TO DETECT DATA-DEPENDENT FAILURES? LINEAR MAPPING X-1 X X+1 L D R 0 1 0 SCRAMBLED MAPPING X-4 X X+2 0 1 0 0 1 0 NOT EXPOSED TO THE SYSTEM Test with specific data pattern in neighboring cells How to detect data-dependent failures when we even do not know which cells are neighbors? SCRAMBLED MAPPING 0 1 0 X-? X X+? GOAL Detects data-dependent failures without the knowledge of the DRAM internal address mapping CURRENT DETECTION MECHANISM Unreliable DRAM Cells Initial Failure Detection and Mitigation Applications in Execution Detect every possible failure with all content before execution Pattern x, Cell A Pattern y, Cell B Pattern z, Cell C (All possible failing cell) 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 List of Failures Applications MEMCON: MEMORY CONTENT-BASED DETECTION AND MITIGATION Unreliable DRAM Cells with Program Content Simultaneous Detection and Execution Based on current memory content of running applications NO NEED TO DETECT EVERY POSSIBLE FAILURE Current content, Cell A 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 Need to detect and mitigate only with the current content List of Failures Application MEMCON: COST-BENEFIT ANALYSIS Cost: Extra memory accesses to read and write rows Benefit: If no failure found, can reduce refresh rate Avg Cost Avg Cost 16 ms HI-REF 64 ms LO-REF Testing Testing Testing Time Cost t 1 t 2 t 3 16 ms HI-REF 64 ms LO-REF Testing 16 ms HI-REF Time Cost t 1 t 2 t 3 Initiate a test only when the cost can amortized Frequent Testing Selective Testing Time How much longer?? MEMCON selectively initiates testing when the write interval is long enough to amortize the cost of testing 1. No initial detection and mitigation 2. Start running the application with a high refresh rate 3. Detect failures with the current memory content If no failure found, use a low refresh rate LO-REF HI-REF LO-REF LO-REF The longer the elapsed time after a write à The longer the write interval Write intervals follow a Pareto distribution WRITE INTERVAL PREDICTION Wait for 1024 ms Expected RIL > 1024 ms Time Write Interval Length After a write, wait for a CIL, where P(RIL) > 1024 is high If idle, predict the interval will last more than 1024 ms MEMCON: REDUCTION IN REFRESH COUNT 0 25 50 75 100 ACBrother AdobePh… AllSysMark AVCHD BlurMotion FinalCutPro FinalMast… AdobePre… MotionPlay Netflix SystemMgt VideoEnc % Reduction in Refresh Over Baseline UPPER BOUND On average 71% reduction in refresh count, very close to the upper bound of 75% 1 1.1 1.2 1.3 1.4 1.5 8Gb 16Gb 32Gb Speedup over Baseline (16 ms) 75% Reduction 60% Reduction 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 8Gb 16Gb 32Gb Speedup over Baseline (16 ms) 75% Reduction 60% Reduction Four-Core Single-Core MEMCON: PERFORMANCE IMPROVEMENT Leads to significant performance improvement 0 0.5 1 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638432768 P(RIL) > 1024 ms Current Interval Length (CIL) in ms ACBrotherHood AdobePhotoshop AllSysMark AVCHD Blur FinalCutProPlayback FinalMaster AdobePremierePro MotionPlayback Netflix SystemMgt VideoEncode If the interval is already 1024 ms long, the probability that the remaining interval is greater than 1024 ms is on average 76% What is the write interval that can amortize the cost? 0 1000 2000 3000 4000 5000 16 112 208 304 400 496 592 688 784 880 976 1072 1168 1264 1360 1456 1552 1648 1744 1840 Accumulated Cost in Latency (ns) Time (ms) Copy and Compare (64 ms) HI-REF (16 ms) 864 ms MinWriteInterval MEMCON: HIGH-LEVEL VISION MEMCON: MEMORY-CONTENT BASED DETECTION

Upload: others

Post on 30-Jan-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Samira Khan, Chris Wilkerson, ZheWang, AlaaAlameldeen ... · Samira Khan, Chris Wilkerson, ZheWang, AlaaAlameldeen, DonghyukLee, OnurMutlu University of Virginia, Intel, Nvidia, ETH

MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content

SamiraKhan,ChrisWilkerson,Zhe Wang,Alaa Alameldeen,Donghyuk Lee,Onur MutluUniversityofVirginia,Intel,Nvidia,ETHZurich

CHALLENGEINDETECTION

VISION:SYSTEM-LEVELDETECTIONANDMITIGATION

UnreliableDRAMCells

Detectand

Mitigate

ReliableSystemDetectandmitigateerrorsafter

thesystemhasbecomeoperationalONLINEPROFILING

UnreliableDRAMCells

BENEFITSOFONLINEPROFILING

ReliableDRAMCells

TechnologyScaling

1. Improvesyield,reducescost,enablesscalingVendorscanmakecellssmallerwithoutastrong

reliabilityguarantee

UnreliableDRAMCells

BENEFITSOFONLINEPROFILINGLO-REF

HI-REFHI-REF

LO-REFLO-REF

2.ImprovesperformanceandenergyefficiencyReducerefreshrate,refreshfaultyrowsmorefrequently

Reducerefreshcountbyusingalowerrefreshrate,butusehigherrefreshrateforfaultycells

0 1 0

Somecellscanfaildependingonthedatastoredinneighboringcells

FAILURE

DETECTIONISHARDDUETOINTERMITTENTFAILURES

1 1 1 NOFAILURE

DATA-DEPENDENTFAILURE

HOWTODETECTDATA-DEPENDENTFAILURES?

LINEARMAPPING X-1 X X+1

L D R0 1 0

SCRAMBLEDMAPPING X-4 X X+2

0 1 00 1 0

NOTEXPOSEDTOTHESYSTEM

TestwithspecificdatapatterninneighboringcellsHowtodetectdata-dependentfailures

whenweevendonotknowwhichcellsareneighbors?

SCRAMBLEDMAPPING

0 1 0

X-?X X+?

GOALDetects data-dependentfailureswithout theknowledgeoftheDRAMinternaladdressmapping

CURRENTDETECTIONMECHANISM

UnreliableDRAMCells

InitialFailureDetectionandMitigation ApplicationsinExecution

Detecteverypossiblefailurewithallcontentbeforeexecution

Patternx,CellAPatterny,CellBPatternz,CellC

…(Allpossiblefailingcell)

0 0 0 1 00 0 0 1 00 0 0 1 00 0 0 1 0

ListofFailures Applications

MEMCON:MEMORYCONTENT-BASEDDETECTIONANDMITIGATION

UnreliableDRAMCellswithProgramContent

SimultaneousDetectionandExecutionBasedoncurrentmemorycontentofrunningapplications

NONEEDTODETECTEVERYPOSSIBLEFAILURE

Currentcontent,CellA

0 1 0 1 00 1 0 0 01 0 0 1 00 0 0 0 1

Needtodetectandmitigateonlywiththecurrentcontent

ListofFailures Application

MEMCON:COST-BENEFITANALYSISCost:ExtramemoryaccessestoreadandwriterowsBenefit:Ifnofailurefound,canreducerefreshrate

Avg Cost

Avg Cost

16 msHI-REF

64 msLO-REF

Testing Testing Testing

Time

Cost

t1 t2 t3

16 msHI-REF

64 msLO-REF

Testing

16 msHI-REF

Time

Cost

t1 t2 t3Initiateatestonlywhenthecostcanamortized

FrequentTesting SelectiveTesting

Time

Howmuchlonger??

MEMCONselectively initiatestestingwhenthewriteintervalislongenough

toamortize thecostoftesting

1. Noinitialdetectionandmitigation

2. Startrunningtheapplicationwithahighrefreshrate

3. Detectfailureswiththecurrentmemorycontent• Ifnofailurefound,usealow

refreshrate

LO-REFHI-REFLO-REFLO-REF

Thelongertheelapsedtimeafterawriteà Thelongerthewriteinterval

WriteintervalsfollowaParetodistribution

WRITEINTERVALPREDICTION

Waitfor1024 ms

ExpectedRIL >1024ms

Time

Write Interval Length

Afterawrite,waitforaCIL,whereP(RIL)>1024ishighIfidle,predicttheintervalwilllastmorethan1024ms

MEMCON:REDUCTIONINREFRESHCOUNT

0255075

100

ACBrothe

r

Adob

ePh…

AllSysMark

AVCH

D

BlurMotion

Fina

lCutPro

Fina

lMast…

Adob

ePre…

MotionP

lay

Netflix

System

Mgt

Vide

oEnc

%Red

uctio

nin

Refre

shOverB

aseline

UPPERBOUND

Onaverage71%reductioninrefreshcount,

veryclosetotheupperboundof75%

1

1.1

1.2

1.3

1.4

1.5

8Gb 16Gb 32Gb

Speedu

pover

Baseline(16ms) 75%Reduction

60%Reduction

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

8Gb 16Gb 32Gb

Speedu

pover

Baseline(16ms) 75%Reduction

60%Reduction

Four-CoreSingle-Core

MEMCON:PERFORMANCEIMPROVEMENT

Leadstosignificantperformanceimprovement

0

0.5

1

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638432768

P(RIL)>102

4ms

CurrentIntervalLength(CIL)inmsACBrotherHood AdobePhotoshop AllSysMark AVCHDBlur FinalCutProPlayback FinalMaster AdobePremiereProMotionPlayback Netflix SystemMgt VideoEncode

Iftheintervalisalready1024ms long,theprobabilitythattheremainingintervalisgreaterthan1024ms isonaverage76%

Whatisthewriteintervalthatcanamortizethecost?

010002000300040005000

16 112

208

304

400

496

592

688

784

880

976

1072

1168

1264

1360

1456

1552

1648

1744

1840

Accumulated

Co

stinLaten

cy

(ns)

Time(ms)

CopyandCompare(64ms) HI-REF(16ms)

864msMinWriteInterval

MEMCON:HIGH-LEVELVISION

MEMCON:MEMORY-CONTENTBASEDDETECTION