scalable dynamic analysis for automated fault location and avoidance

53
Scalable Dynamic Analysis for Automated Fault Location and Avoidance Rajiv Gupta Funded by NSF grants from CPA, CSR, & CRI programs and grants from Microsoft Research

Upload: stone-joyner

Post on 03-Jan-2016

38 views

Category:

Documents


3 download

DESCRIPTION

Scalable Dynamic Analysis for Automated Fault Location and Avoidance. Rajiv Gupta. Funded by NSF grants from CPA , CSR , & CRI programs and grants from Microsoft Research. Motivation. Software bugs cost the U.S. economy about $59.5 billion each year [ NIST 02 ] . Embedded Systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Scalable Dynamic Analysis for Automated Fault Location and

Avoidance

Rajiv Gupta

Funded by NSF grants from CPA, CSR, & CRI programsand grants from Microsoft Research

Page 2: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Motivation

Software bugs cost the U.S. economy about $59.5 billion each year [NIST 02].

Embedded Systems• Mission Critical / Safety Critical Tasks • A failure can lead to Loss of Mission/Life.

(Ariane 5) arithmetic overflow led to shutdown of guidance computer.(Mars Climate Orbiter) missed unit conversion led to faulty navigation data.(Mariner I) missing superscripted bar in the specification for the guidance

program led to its destruction 293 seconds after launch. (Mars Pathfinder) priority inversion error causing system reset.(Boeing 747-400) loss of engine & flight displays while in flight.(Toyota hybrid Prius) VSC, gasoline-powered engine shut off. (Therac-25) wrong dosage during radiation therapy.…….

Page 3: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Overview

Dynamic SlicingOffline

Environment FaultsOnline

Program Execution

Fault

Fault Location

Fault Avoidance

ScalabilityTracing + Logging

Long-runningMulti-threaded

Page 4: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Fault Location

Goal:

Assist the programmer in debugging by automatically narrowing the fault to a small section of the code.

Dynamic Information• Data dependences• Control dependences• Values

Execution Runs• One failed execution & • Its perturbations

Page 5: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Dynamic Information

……

ExecutionProgram Dynamic Dependence Graph

Data Control

Page 6: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Approach

Detect execution of statement s such that Faulty code Affects the value computed by s; or Faulty code is Affected-by the value computed by s

through a chain of dependences.

Estimate the set of potentially faulty statements from s:

Affects: statements from which s is reachable in the dynamic dependence graph. (Backward Slice)

Affected-by: statements that are reachable from s in the dynamic dependence graph. (Forward Slice)

Intersect slices to obtain a smaller fault candidate set.

Page 7: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Backward & Forward Slices

Erroneous Output

Failure inducingInputBackward Slice

[Korel&Laski,1988] Forward Slice [ASE-05]

Page 8: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Failure InducingInput

ErroneousOutput

[ASE-05]

For memory bugs the number of statements is very small (< 5).

Backward & Forward Slices

Page 9: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Bidirectional Slices

BackwardSlice of CP

ForwardSlice of CP

+ BidirectionalSliceCombined

Slice

[ICSE-06]

• Found critical predicates in 12 out of 15 bugs• Search for critical predicate: Brute force: 32 predicates to 155K predicates; After Filtering and Ordering: 1 to 7K predicates.

Critical Predicate:

An execution instance

of a predicate such

that changing its

outcome “repairs” the

program state.

Page 10: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Pruning Slices

Confidence in v

v

C(v): [0,1]

0 - all values of v produce same

1 - any change in v will change

How? Value profiles.

1 1

1

[PLDI-06]

Page 11: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Test Programs

Real Reported Bugs Injected Bugs

Nine logical bugs (incorrect ouput)

• Unix utilities grep 2.5, grep 2.5.1,

flex 2.5.31, make 3.80.

Six memory bugs (program crashes)

• Unix utilities gzip, ncompress,

polymorph, tar, bc, tidy.

Siemens Suite (numerous versions)

schedule, schedule2, replace, print_tokens..

• Unix utilities gzip, flex

Page 12: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Dynamic Slice Sizes

Buggy Runs BS FS BiS

flex 2.5.31(a) 695 605 225flex 2.5.31(b) 272 257 NA

flex 2.5.31(c) 50 1368 NA

grep 2.5 NA 731 88

grep 2.5.1(a) NA 32 111

grep 2.5.1(b) NA 599 NA

grep 2.5.1(c) NA 12 453

make 3.80(a) 981 1239 1372

make 3.80(b) 1290 1646 1436

gzip-1.2.4 34 3 39

ncompress-4.2.4 18 2 30

polymorph-0.4.0 21 3 22

tar 1.13.25 105 202 117

bc 1.06 204 188 267

tidy 554 367 541

Page 13: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Combined Slices

Buggy Runs BS BS^FS^BiS (%BS)

flex 2.5.31(a) 695 27 (3.9%)

flex 2.5.31(b) 272 102 (37.5%)

flex 2.5.31(c) 50 5 (10%)

grep 2.5 NA 86 (7.4%*EXEC)

grep 2.5.1(a) NA 25 (4.9%*EXEC)

grep 2.5.1(b) NA 599 (53.3%*EXEC)

grep 2.5.1(c) NA 12 (0.9%*EXEC)

make 3.80(a) 981 739 (81.4%)

make 3.80(b) 1290 1051 (75.3%)

gzip-1.2.4 34 3 (8.8%)

ncompress-4.2.4 18 2 (14.3%)

polymorph-0.4.0 21 3 (14.3%)

tar 1.13.25 105 45 (42.9%)

bc 1.06 204 102 (50%)

tidy 554 161 (29.1%)

Page 14: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Evaluation of Pruning

Program Description LOC Versions Tests

print_tokens Lexical analyzer 565 5 4072

print_tokens2 Lexical analyzer 510 5 4057replace Pattern replacement 563 8 5542

schedule Priority scheduler 412 3 2627schedule2 Priority scheduler 307 3 2683

gzip Unix utility 8009 1 1217flex Unix utility 12418 8 525

Siemen’s Suite

• Single error is injected in each version.• All the versions are not included:

No output or the very first output is wrong; Root cause is not contained in the BS (code missing error).

Page 15: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Program BS Pruned Slice Pruned Slice / BS

print_tokens 110 35 31.8%

print_tokens2 114 55 48.2%replace 131 60 45.8%

schedule 117 70 59.8%schedule2 90 58 64.4%

gzip 357 121 33.9%flex 727 27 3.7%

Evaluation of Pruning

Page 16: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Effectiveness

Backward

Slice

[AADEBUG-05]

≈ 31% of Executed

Statements

Combined Slice

[ASE-05,ICSE-06]

≈ 36% of Backward

Slice≈ 11% of Exec.

Erroneous outputFailure inducing inputCritical predicate

Pruned Slice

[PLDI-06]

≈ 41% of Backward

Slice≈ 13% of Exec.

ConfidenceAnalysis

Page 17: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Effectiveness

Slicing is effective in locating faults.

No more than 10 static statements had to be inspected.

Program-bug Inspected Stmts.

mutt – heap overflow 8

pine – stack overflow 3

pine – heap overflow 10

mc – stack overflow 2

squid – heap overflow 5

bc – heap overflow 3

Page 18: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Execution Omission Errors

X=

X=

=X

A =

A<0

X=

X=

=X

A =

A<0

X=

=X

A =

A<0

• Inspect pruned slice.• Dynamically detect an Implicit dependence.• Incrementally expand the pruned slice.

[PLDI-07]

Implicitdependence

Page 19: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Scalability of Tracing

Dynamic Information Needed• Dynamic Dependences

for all slicing• Values for Confidence Analysis

for pruning slices

annotates the static program representation

Whole Execution Trace (WET)

Trace Size• ≈ 15 Bytes / Instruction

Page 20: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Trace Sizes & Collection Overheads

Trace sizes are very large for even 10s of execution.

Program Running Time

Dep. Trace

Collection Time

mysql 13 s 21 GB 2886 s

prozilla 8 s 6 GB 2640 s

proxyC 10 s 456 MB 880 s

mc 10 s 55 GB 418 s

mutt 20 s 388 GB 3238 s

pine 14 s 156 GB 2088 s

squid 15 s 88 GB 1132 s

Page 21: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Compacting Whole Execution Traces

Explicitly remember dynamic control flow trace. Infer as many dynamic dependences as possible

from control flow (94%), remember the remaining dependences explicitly (≈ 6%). Specialized graph representation to enable inference.

Explicitly remember value trace. Use context-based method to compress dynamic

control flow, value, and address trace. Bidirectional traversal with equal ease

[MICRO-04, TACO-05]

Page 22: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Input: N=2

51: for I=1 to N do61: if (i%2==0) then

71: p=&a

81: a=a+191: z=2*(*p)

101: print(z)

11: z=021: a=031: b=2

41: p=&b

52: for I=1 to N do

62: if (i%2==0) then

82: a=a+192: z=2*(*p)

1: z=02: a=03: b=24: p=&b5: for i = 1 to N do6: if ( i %2 == 0) then7: p=&a endif endfor8: a=a+19: z=2*(*p)10: print(z)

Dependence Graph Representation

Page 23: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

5:for i=1 to N

6:if (i%2==0) then

7: p=&a

8: a=a+1

9: z=2*(*p)

10: print(z)

T F

1: z=0

2: a=0

3: b=2

4: p=&b

T

Input: N=2

11: z=0

21: a=0

31: b=2

41: p=&b

51: for i = 1 to N do

61: if ( i %2 == 0) then

81: a=a+1

91: z=2*(*p)

52: for i = 1 to N do

62: if ( i %2 == 0) then

71: p=&a

82: a=a+1

92: z=2*(*p)

101: print(z)

T1

2

3

4

5

6

7

8

9

10

11

12

13

14

<3,8><2,7>

<7,12>

<11,13>

<13,14>

<4,8>

<12,13>

<5,6><9,10>

<10,11>

<5,7><9,12>

<5,8><9,13>

F

Dependence Graph Representation

Page 24: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

1

2

3

4 5

6

7 8

9

10

1

2

3

4

5

6

7

2

1

10

34679

35679

34689

35689

1

2

3

2

1

10

34679

1

2

3

34 5

67 8

9

Transform: Traces of Blocks

Page 25: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Infer: Local Dependence Labels

X =

Y= X

X =

Y= X

(10,10)

(20,20)

(30,30)

10,20,30

X =

Y= X

=Y

21

(20,21)

...

(...,20)

...

Page 26: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Transform: Local Dep. Labels

X =

*P =

Y= X

X =

*P =

Y= X

(10,10)

(20,20)

10,20

X =

*P =

Y= X

(20,20)

Page 27: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Transform: Local Dep. Labels

X =

*P =

Y= X

X =

*P =

Y= X

(10,10)

(20,20) X =

*P =

Y= X

10,20

X =

*P =

Y= X

2010

11,21

(10,11) (20,21)

=Y

11,21

=Y

(20,21) (10,11)

Page 28: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Group: Non-Local Dep. Edges

X =

Y =

= Y

= X

X =

Y =

X =

Y =

= Y

= X

X =

Y =

(10,21)

(10,21)

(20,11)

(20,11)

X =

Y =

= Y

= X

X =

Y =

(20,11)

(10,21)

11,21

10

20

Page 29: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Compacted WET Sizes

ProgramStatementsExecuted(Millions)

WET Size (MB) Before /

After Before After

300.twolf

256.bzip2

255.vortex

197.parser

181.mcf

164.gzip

130.li

126.gcc

099.go

Average

690

751

609

615

715

650

740

365

685

647

10,666

11,921

8,748

8,730

10,541

9,688

10,399

5,238

10,369

9,589

647

221

105

188

416

538

203

89

575

331

16.49

54.02

83.63

46.34

25.33

18.02

51.22

58.84

18.04

41.33

≈ 4 Bits / Instruction

Page 30: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

[PLDI-04]vs.

[ICSE-03]

Slicing Times

Page 31: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Dep. Graph Generation Times

Offline post-processing after collecting address and control flow traces ≈ 35x of execution time

Online techniques [ICSM 2007] Information Flow: 9x to18x slowdown Basic block Opt.: 6x to10x slowdown Trace level Opt.: 5.5x to 7.5x slowdown Dual Core: ≈1.5x slowdown

Online Filtering techniques Forward slice of all inputs User-guided bypassing of functions

Page 32: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Reducing Online Overhead

Record non-deterministic events online• Less than 2x overhead• Deterministic replay of executions

Trace faulty executions off-line• Replay the execution• Switch on tracing• Collect and inspect traces

Trace analysis is still a problem• The traces correspond to huge executions• Off-line overhead of trace collection is still significant

Page 33: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Reducing Trace Sizes

Checkpointing Schemes Trace from the most recent checkpoint

• Checkpoints are of the order of minutes.• Better but the trace sizes are still very large.

Exploiting Program Characteristics Multithreaded and server-like [ISSTA-07, FSE-06]

• Examples : mysql, apache.• Each request spawns a new thread.• Do not trace irrelevant threads.

Page 34: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Beyond Tracing

Checkpoint: capture memory image.

Execute and Record (log) Events.

Upon Crash, Rollback to checkpoint. Reduce log and Replay execution using reduced log. Turn on tracing during replay.

xCheckpoint log

x TraceReducedlog

Applicable to Multithreaded Programs

[ISSTA-07]

Page 35: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

An Example

A mysql bug • “ load …” command will crash the server if database is not

specified

sql/mysql_load.cc:int mysql_load (THD *thd,...){

…150 if( …151 … +strlen(thd->db) + 3 <152 FN_REFLEN) ...}

Without typing “use database_name”, thd->db is Null.

Page 36: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Example – Execution and log file

open path=/etc/my.cnf

Wait for connection

Create Thread 1

Wait for command

Create Thread 2

Wait for command

Recv “show databases”

Handle command

Recv “load data …”

Handle -- (server crashes)

Recv “use test; select * from b”

Handle command

Run mysql server

User 1 connects to the server

User 2 connects to the server

User 1: “show databases”

User 2: “use test”

“ select * from b”

User 1: “load data into table1”

Time

Blue – T0

Red – T1

Green – T2

Gray - Scheduler

Page 37: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Execution Replay using Reduced log

open path=/etc/my.cnf

Wait for connection

Create Thread 1

Recv “load data …”

Handle -- (server crashes)

Run mysql server

User 1 connects to the server

User 2 connects to the server

User 1: “show databases”

User 2: “show databases”

“ select * from b”

User 1: “load data into table1”

Time

Page 38: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Execution Reduction

Effects of Reduction Irrelevant Threads Replay-only vs. Replay & Trace

How? By identifying Inter-thread Dependences Event Dependences - found using the log File Dependences - found using the log Shared-Memory Dependences - found using replay

Space requirement reduced by 4x

Time requirement reduced by 2x

Naïve approach requires thread id of last writer of each address Space and time efficient detection

o Memory Regions: Non-shared vs sharedo Locality of References to Regions

Page 39: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Experimental Results

Page 40: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Original OptimizedProgram-bug

Trace SizesNum. of

dependences

Experimental Results

Page 41: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Orig. OPT.Program-bug

Execution Times

(seconds)

Logging

Experimental Results

Page 42: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Debugging System

SlicingModule

WETSlicesApplication

binary

ExecutionEngineValgrind

Input Output

Traces

Instrumentcode

CompressedTrace

Static BinaryAnalyzer

Diablo

Control Dependence

ReducedLog

RecordReplayJockey

Checkpoint+

log

Page 43: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Fault Avoidance

Large number of faults in server programs are caused by the environment.

• 56 % of faults in Apache server.

Types of Faults Handled Atomicity Violation Faults.

• Try alternate scheduling decisions.

Heap Buffer Overflow Faults.• Pad memory requests.

Bad User Request Faults.• Drop bad requests.

Avoidance Strategy Recover first time, Prevent later.

• Record the change that avoided the fault.

Page 44: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Experiments

Program Type of Bug Env. Change # of Trials

Time taken (secs.)

mysql-1 Atomicity Violn. Scheduler 1 130

mysql-2 Atomicity Violn. Scheduler 1 65

mysql-3 Atomicity Violn. Scheduler 1 65

mysql-4 Buffer Overflow. Mem. Padding 1 700

pine-1 Buffer Overflow. Mem. Padding 1 325

pine-2 Buffer Overflow. Mem. Padding 1 270

mutt-1 Bad User Req. Drop Req. 3 205

bc-1 Bad User Req. Drop Req. 3 290

bc-2 Bad User Req. Drop Req. 3 195

Page 45: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Summary

Dynamic SlicingOffline

Environment FaultsOnline

Program Execution

Fault

Fault Location

Fault Avoidance

ScalabilityTracing + Logging

Long-runningMulti-threaded

Page 46: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Dissertations

Xiangyu Zhang, Purdue University Fault Location Via Precise Dynamic

Slicing, 2006. SIGPLAN Outstanding Doctoral Dissertation Award

Sriraman Tallam, Google

Fault Location and Avoidance in Long-Running Multithreaded Programs, 2007.

Page 47: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Ongoing Work

Monitoring Parallel Applications On Multicores: Vijay Nagarajan [ISCA’09]

Debugging Parallel Applications

State-based Approach: Dennis Jeffrey [ISSTA’08a]

Race-detection in Parallel Applications:

Chen Tian & Vijay Nagarajan [ISSTA’08b]

Optimistic Parallelization for Multicores Speculation: Min Feng & Chen Tian [MICRO’08]

Page 48: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance
Page 49: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Breakdowns of Different Optimizations

Infer

Transform

Group

Others

Explicit

Page 50: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Reducing Trace Sizes

Checkpointing Schemes Trace from the most recent checkpoint

• Checkpoints are of the order of minutes.• Better but the trace sizes are still very large.

Exploiting Program Characteristics Multithreaded and server-like [ISSTA-07]

• Examples : mysql, apache.• Each request spawns a new thread.• Do not trace irrelevant threads.

Single-threaded and event processing [FSE-06]• Examples : pine, mutt, squid.• Events are independent.• Do not trace irrelevant events.

Page 51: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

Example of a Fault

Input:• > ./mysqld –log-bin= binlog• > ./mysql -u root -D test -e 'delete from b' &• > ./mysql -u root -D test -e 'insert into b values (1)' &

Output Observation• > ./mysqlbinlog binlog

set timestamp = 1152573474 insert into b values (1); set timestamp = 1152573470 delete from b

Page 52: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

File : mysql_delete.cc

mysql_delete(THD *thd, ...)

...

152 error=generate_table(thd, ...);

...

generate_table(THD *thd, ...)

...

81 pthread_mutex_lock(...);

... // Critical Section

105 pthread_mutex_unlock(...);

108 ... // Logging not locked

109 mysql_update_log.write(thd,...);

...

Log of an execution

(scheduling decision)

TEI 1 Recv “delete …”

Handle it but hasn’t logged it

TEI 2 Recv “insert”

Handle it (logged insert)

TEI 3 Logging delete

Red – T1

Green – T2

Gray - Scheduler

Example of a Fault

Page 53: Scalable Dynamic Analysis for   Automated Fault Location and Avoidance

System for Fault Avoidance

Put the server program into a 3-phase system• Log the original execution• Replay execution and apply the environmental changes• Use environment patch to prevent the fault from occurring again

Logging – jockey system. Apply environment changes – valgrind system.

Logging PhaseOrig. Execution

EventLog

Fault AvoidancePhase

(Env. Changes)Env. Fault

Patch

Prevention-LoggingPhase

Fault Avoided

FAULT REPLAY & AVOID

NEWFAULT

OLDFAULT

RECORDPATCH

[COMPSAC-08]