basic program analysis - columbia university

28
Basic Program Analysis Suman Jana *some slides are borrowed from Baishakhi Ray and Ras Bodik

Upload: others

Post on 18-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Basic Program Analysis SumanJana

*someslidesareborrowedfromBaishakhiRayandRasBodik

Our Goal

ProgramAnalyzer

SourcecodeSecuritybugs

Programanalyzermustbeabletounderstandprogramproper?es(e.g.,canavariablebeNULLata

par?cularprogrampoint?)

Mustperformcontrolanddataflowanalysis

Do we need to implement control and data flow analysis from scratch?

• Mostmoderncompilersalreadyperformseveraltypesofsuchanalysisforcodeop?miza?on• Wecanhookintodifferentlayersofanalysisandcustomizethem• Wes?llneedtounderstandthedetails

•  LLVM(hNp://llvm.org/)isahighlycustomizableandmodularcompilerframework•  UserscanwriteLLVMpassestoperformdifferenttypesofanalysis•  Clangsta?canalyzercanfindseveraltypesofbugs•  Caninstrumentcodefordynamicanalysis

Compiler Overview

• AbstractSyntaxTree:SourcecodeparsedtoproduceAST• ControlFlowGraph:ASTistransformedtoCFG• DataFlowAnalysis:operatesonCFG

The Structure of a Compiler

5

scanner

parser

checker

codegen

Sourcecode(streamofcharacters)

streamoftokens

AbstractSyntaxTree(AST)

ASTwithannota?ons(types,declara?ons)

Machine/bytecode

SyntacBc Analysis

•  Input:sequenceoftokensfromscanner• Output:abstractsyntaxtree• Actually,

•  parserfirstbuildsaparsetree•  ASTisthenbuiltbytransla?ngtheparsetree•  parsetreerarelybuiltexplicitly;onlydeterminedby,say,howparserpushesstufftostack

6AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5

Example

•  SourceCode 4*(2+3)

• ParserinputNUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR

• Parseroutput(AST):

AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 7

*

NUM(4) +

NUM(2) NUM(3)

Parse tree for the example: 4*(2+3)

AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 8

leavesaretokens

NUM(4)TIMESLPARNUM(2)PLUSNUM(3)RPAR

EXPR

EXPR

EXPR

Another example •  SourceCode

if(x==y){a=1;}• Parserinput

IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR

• Parseroutput(AST):

AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 9

IF-THEN

==

ID ID

=

ID INT

Parse tree for example: if (x==y) {a=1;}

AdoptedFromUCBerkeley:Prof.BodikCS164Lecture5 10

IFLPARID==IDRPARLBRID=INTSEMIRBR

EXPREXPR

STMT

BLOCK

STMT

leavesaretokens

Parse Tree

• Representa?onofgrammarsinatree-likeform.

•  Isaone-to-onemappingfromthegrammartoatree-form.

Aparsetreepictoriallyshowshowthestartsymbolofagrammarderivesastringinthe

language.…DragonBook

CStatement:returna+2

averyformalrepresenta?onthatstrictlyshowshowtheparser

understandsthestatementreturna+2;

Abstract Syntax Tree (AST)

•  Simplifiedsyntac?crepresenta?onsofthesourcecode,andthey'remostopenexpressedbythedatastructuresofthelanguageusedforimplementa?on

• Withoutshowingthewholesyntac?ccluNer,representstheparsedstringinastructuredway,discardingallinforma?onthatmaybeimportantforparsingthestring,butisn'tneededforanalyzingit.

ASTsdifferfromparsetreesbecausesuperficialdis?nc?onsofform,unimportantfortransla?on,donotappearinsyntaxtrees..…DragonBook

CStatement:returna+2

AST

Disadvantages of ASTs

• ASThasmanysimilarforms•  E.g.,for,while,repeat...un?l•  E.g.,if,?:,switch

•  ExpressionsinASTmaybecomplex,nested•  (x*y)+(z>5?12*z:z+20)

• Wantsimplerrepresenta?onforanalysis•  ...atleast,fordataflowanalysis

15

intx=1//what’sthevalueofx?//ASTtraversalcangivetheanswer,right?Whataboutintx;x=1;orintx=0;x+=1;?

Control Flow Graph & Analysis

RepresenBng Control Flow

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

High-level representation – Control flow is implicit in an AST

Low-level representation: – Use a Control-flow graph (CFG)

– Nodes represent statements (low-level linear IR) – Edges represent explicit flow of control

What Is Control-Flow Analysis?

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

1 2

a := 0 b := a * b

3 L1: c := b/d

4 5 6

if c < x goto L2 e := b / c f := e + 1

7 L2: g := f

8 9

h := t - g if e > 0 goto L3

10 goto L1 11 L3: return

a := 0 b := a * b

e := b / c f : e + 1

g:=fh:=t–gIfe>0?

goto return

c := b / d c < x?

1

3

5

7

1110

Yes No

Basic Blocks

•  A basic block is a sequence of straight line code that can be entered only at the beginning and exited only at the end

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

g:=fh:=t–gIfe>0?

Building basic blocks – Identify leaders

– The first instruction in a procedure, or – The target of any branch, or – An instruction immediately following a branch (implicit target)

– Gobble all subsequent instructions until the next leader

Basic Block Example

1 2

a := 0 b := a * b

3 L1: c := b/d

4 5 6

if c < x goto L2 e := b / c f := e + 1

7 L2: g := f

8 9

h := t - g if e > 0 goto L3

10 goto L1 11 L3: return

Leaders? – {1, 3, 5, 7, 10, 11}

Blocks? – {1, 2} – {3, 4} – {5, 6} – {7, 8, 9} – {10} – {11}

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

Building a CFG From Basic Block

a := 0 b := a * b

e := b / c f : e + 1

g:=fh:=t–gIfe>0?

goto return

c := b / d c < x?

1

3

5

7

1110

Yes No

Construc=on– EachCFGnoderepresentsabasicblock– Thereisanedgefromnodeitojif– Laststatementofblockibranchestothefirststatementofj,or– Blockidoesnotendwithanuncondi?onalbranchandisimmediatelyfollowedinprogramorderbyblockj(fallthrough)

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

Looping

preheader

head

tail exitedge

Exitedge

backedge

entryedge

Loop

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

Why?backedgesindicatethatwemightneedtotraversetheCFGmorethanoncefordataflowanalysis

Looping

preheader

head

tail exitedge

Exitedge

backedge

entryedge

Loop

Notallloopshavepreheaders–Some?mesitisusefultocreatethem

Withoutpreheadernode–Therecanbemul?pleentryedges

Withsinglepreheadernode–Thereisonlyoneentryedge

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

Dominators

•  d dom i if all paths from entry to node i include d

• Strict Dominator (d sdom i) •  If d dom i, but d != i

•  Immediate dominator (a idom b) •  a sdom b and there does not exist any node c such that a != c, c != b, a dom c,

c dom b

• Post dominator (p pdom i) •  If every possible path from i to exit includes p

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

IdenBfying Natural Loops and Dominators

• BackEdge•  A back edge of a natural loop is one whose target dominates its source

• NaturalLoop•  The natural loop of a back edge (m→n), where n dominates m, is the set of

nodes x such that n dominates x and there is a path from x to m not containing n

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

Reducibility •  ACFGisreducible(well-structured)ifwecanpar??onitsedgesintotwodisjointsets,theforwardedgesandthebackedges,suchthat – Theforwardedgesformanacyclicgraphinwhicheverynodecanbereachedfromtheentrynode– Thebackedgesconsistonlyofedgeswhosetargetsdominatetheirsources

•  Structuredcontrol-flowconstructsgiverisetoreducibleCFGsValueofreducibility:– Dominanceusefuliniden?fyingloops– Simplifiescodetransforma?ons(everyloophasasingleheader)– Permits interval analysis

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

Handling Irreducible CFG’s

• Nodesplivng•  CanturnirreducibleCFGsintoreducibleCFGs

a

b

c d

e

b

c

a

d

dʹ e

Generalidea– Reducegraph(itera?velyremoveselfedges,mergenodeswithsinglepred)– Morethanonenode=>irreducible–Splitanymul?-parentnodeandstartover

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)

Why go through all this trouble? • Modernlanguagesprovidestructuredcontrolflow– Shouldn’tthecompilerrememberthisinforma?onratherthanthrowitawayandthenre-computeit?

•  Answers?– Wemaywanttoworkonthebinarycode– Mostmodernlanguagess?llprovideagotostatement– Languages typically providemul?ple types of loops. This analysis letsus treatthemalluniformly– Wemaywantacompilerwithmul?plefrontendsformul?plelanguages;ratherthantransla?ngeachlanguagetoaCFG,translateeachlanguagetoacanonicalIRandthentoaCFG

AdoptedFromUPennCIS570:ModernProgrammingLanguageImplementa?on(Autumn2006)