analyses and optimizations for multithreaded programs martin rinard, alex salcianu, brian demsky mit...

Analyses and Optimizations for Multithreaded Programs

Martin Rinard, Alex Salcianu,Brian Demsky

MIT Laboratory for Computer Science

John Whaley IBM Tokyo Research Laboratory

Motivation

• Threads are Ubiquitous• Parallel Programming for Performance• Manage Multiple Connections• System Structuring Mechanism

• Overhead• Thread Management• Synchronization

• Opportunities• Improved Memory Management

What This Talk is About

• New Abstraction: Parallel Interaction Graph• Points-To Information• Reachability and Escape Information • Interaction Information

•Caller-Callee Interactions•Starter-Startee Interactions

• Action Ordering Information• Analysis Algorithm• Analysis Uses (synchronization elimination,

stack allocation, per-thread heap allocation)

Outline

• Example• Analysis Representation and Algorithm• Lightweight Threads• Results• Conclusion

Sum Sequence of Numbers

9 8 1 5 3 7 2 6

Group in Subsequences

9 8 1 5 3 7 2 6

Sum Subsequences (in Parallel)

9 8 1 5 3 7 2 6

Add Sums Into Accumulator

9 8 1 5 3 7 2 6

Accumulator0

9 8 1 5 3 7 2 6

Accumulator17

9 8 1 5 3 7 2 6

Accumulator23

9 8 1 5 3 7 2 6

Accumulator33

9 8 1 5 3 7 2 6

Accumulator41

Common Schema

• Set of tasks• Chunk tasks to increase granularity• Tasks have both

• Independent computation• Updates to shared data

Realization in Java

class Accumulator { int value = 0; synchronized void add(int v) { value += v; }}

Realization in Java

class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; }

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); }}

work dest

Accumulator

Vector

Realization in Java

class Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; }

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); }}

work dest

Accumulator

Vector

Enumeration

Realization in Java

void generateTask(int l, int u, Accumulator a) { Vector v = new Vector(); for (int j = l; j < u; j++) v.addElement(new Integer(j)); Task t = new Task(v,a); t.start();}void generate(int n, int m, Accumulator a) { for (int i = 0; i < n; i ++) generateTask(i*m, i*(m+1),

Accumulator0

Task Generation

Accumulator

Vector0

Task Generation

Accumulator

Vector0

Task Generation

Accumulator

Vector0

Task Generation

work dest

Accumulator

Vector0

Task Generation

work dest

Accumulator

Vector0

Vector

Task Generation

work dest

Accumulator

Vector0

Vector

Task Generation

work dest

Accumulator

Vector0

Vector

Task Generation

Analysis

Analysis Overview

• Interprocedural• Interthread • Flow-sensitive

• Statement ordering within thread• Action ordering between threads

• Compositional, Bottom Up• Explicitly Represent Potential

Interactions Between Analyzed and Unanalyzed Parts

• Partial Program Analysis

Analysis Result for run Method

Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}

•Abstraction: Points-to Graph

•Nodes Represent Objects•Edges Represent References

work dest

Vector

Enumeration

Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Inside Nodes

•Objects Created Within Current Analysis Scope

•One Inside Node Per Allocation Site

•Represents All Objects Created At That Site

work dest

Vector

Enumeration

Accumulator

•Outside Nodes•Objects Created Outside Current Analysis Scope

•Objects Accessed Via References Created Outside Current Analysis Scope

work dest

Vector

Enumeration

Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Outside Nodes

•One per Static Class Field •One per Parameter•One per Load Statement

• Represents Objects Loaded at That Statement

work dest

Vector

Enumeration

Accumulator

•Inside Edges•References Created Inside Current Analysis Scope

work dest

Vector

Enumeration

Accumulator

•Outside Edges•References Created Outside Current Analysis Scope

•Potential Interactions in Which Analyzed Part Reads Reference Created in Unanalyzed Part

work dest

Vector

Enumeration

Concept of Escaped Node

• Escaped Nodes Represent Objects Accessible Outside Current Analysis Scope• parameter nodes, load nodes• static class field nodes• nodes passed to unanalyzed methods• nodes reachable from unanalyzed but

started threads• nodes reachable from escaped nodes

• Node is Captured if it is Not Escaped

Why Escaped Concept is Important

• Completeness of Analysis Information• Complete information for captured

nodes• Potentially incomplete for escaped nodes

• Lifetime Implications• Captured nodes are inaccessible when

analyzed part of the program terminates• Memory Management Optimizations

•Stack allocation •Per-Thread Heap Allocation

Intrathread Dataflow Analysis

• Computes a points-to escape graph for each program point

• Points-to escape graph is a pair <I,O,e>• I - set of inside edges• O - set of outside edges• e - escape information for each node

Dataflow Analysis

• Initial state:I : formals point to parameter

nodes,classes point to class nodes

O: Ø• Transfer functions:

I´ = (I – KillI ) U GenI

O´ = O U GenO

• Confluence operator is U

Intraprocedural Analysis

• Must define transfer functions for:• copy statement l = v

• load statement l1 = l2.f

• store statement l1.f = l2

• return statement return l• object creation site l = new cl

• method invocation l = l0.op(l1…lk)

copy statement l = v

KillI = edges(I, l)

GenI = {l} × succ(I, v)

Existing edges

copy statement l = v

KillI = edges(I, l)

GenI = {l} × succ(I, v)

Generated edges

load statement l1 = l2.f

SE = {n2 in succ(I, l2) . escaped(n2)}

SI = U{succ(I, n2, f) . n2 in succ(I, l2)}

case 1: l2 does not point to an escaped node (SE = Ø)

KillI = edges(I, l1)

GenI = {l1} × SI

Existing edges

SE = {n2 in succ(I, l2) . escaped(n2)}

SI = U{succ(I, n2, f) . n2 in succ(I, l2)}

case 1: l2 does not point to an escaped node (SE = Ø)

GenI = {l1} × SI

Generated edges

case 2: l2 does point to an escaped node (not SE = Ø)

GenI = {l1} × (SI U {n})

GenO = (SE × {f}) × {n}

where n is the load node for l1 = l2.f

Existing edges

case 2: l2 does point to an escaped node (not SE = Ø)

GenI = {l1} × (SI U {n})

GenO = (SE × {f}) × {n}

where n is the load node for l1 = l2.f

Generated edges

store statement l1.f = l2

GenI = (succ(I, l1) × {f}) × succ(I, l2)

I´ = I U GenI

Existing edges

store statement l1.f = l2

GenI = (succ(I, l1) × {f}) × succ(I, l2)

I´ = I U GenI

Generated edges

object creation site l = new cl

KillI = edges(I, l)

GenI = {<l, n>}

where n is inside node for l = new cl

Existing edges

object creation site l = new cl

KillI = edges(I, l)

GenI = {<l, n>}

where n is inside node for l = new cl

Generated edges

Method Call

• Analysis of a method call:• Start with points-to escape graph

before the call site• Retrieve the points-to escape graph

from analysis of callee• Map outside nodes of callee graph to

nodes of caller graph• Combine callee graph into caller graph

• Result is the points-to escape graph after the call site

Points-to Escape Graphbefore call to

t = new Task(v,a)

Start With Graph Before Call

t = new Task(v,a)

Points-to Escape Graphfrom analysis of

Task(w,d)

Retrieve Graph from Callee

t = new Task(v,a)

Task(w,d)

Map Parameters from Callee to Caller

Combined Graphafter call to

t = new Task(v,a)

Task(w,d)

Transfer Edges from Callee to Caller

t = new Task(v,a)

Discard Parameter Nodes from Callee

x.foo()

More General Example

x.foo()

Initialize MappingMap Formals to Actuals

x.foo()

Extend MappingMatch Inside and Outside Edges

Mapping is UnidirectionalFrom Callee to Caller

x.foo()

Complete Mapping Automap Load and Inside Nodes Reachable

from Mapped Nodes

x.foo()

Combine MappingProject Edges from Callee Into Combined

x.foo()

Discard Callee Graph

x.foo()

Discard Outside Edges From Captured Nodes

Interthread Analysis

• Augment Analysis Representation • Parallel Thread Set• Action Set (read,write,sync,create edge)• Action Ordering Information

(relative to thread start actions)• Thread Interaction Analysis

• Combine points-to graphs• Induces combination of other information

• Can perform interthread analysis at any point to improve precision of results

Points-to Escape Graphsometime after call to

x.start()

Combining Points-to Graphs

x this

x.start()

Initialize MappingMap Startee Thread to Starter

Thread

x this

x.start()

x this

x.start()

x this

x.start()

x this

Mapping is BidirectionalFrom Startee to StarterFrom Starter to Startee

x.start()

Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes

x this

Combined Points-to Escape Graph sometime after call to

x.start()

Combine GraphsProject Edges Through Mappings Into

Combined Graph

x this

x.start()

Combined Graph

x this

x.start()

Combined Graph

x this

x.start()

Combined Graph

x this

x.start()

Discard StarteeThread Node

x this

x.start()

Discard Startee Thread Node

x.start()

Discard Outside Edges From Captured Nodes

Life is not so Simple

• Dependences between phases• Mapping best framed as constraint

satisfaction problem• Solved using constraint satisfaction

algorithm

Interthread Analysis With Actions and Ordering

Accumulatorb e

awork dest

Vector

ParallelThreads

Actions

sync b

Points-to Graph

Action Ordering

“All actionshappen before

thread a starts

executing”

Analysis Result for generateTask

Enumeration

Accumulator2 5

1work dest

Vector

ParallelThreads

Actions

Action Ordering

noparallelthreads

sync 2

Points-to Graph

Analysis Result for run

sync 5

edge(1,2)

edge(1,5)

edge(2,3)

edge(3,4)

Role of edge(1,2) Actions

• One edge action for each outside edge• Action order for edge actions improves

precision of interthread analysis• If starter thread reads a reference

before startee thread is started• Then reference was not created by

startee thread• Outside edge actions record order• Inside edges from startee matched only

against parallel outside edges

x.start()

Edge Actions in Combining Points-to Graphs

x this

Action Ordering

edge(1,2) || 1

x.start()

Edge Actions in Combining Points-to Graphs

x this

Action Ordering

(i.e., edge(1,2)created before

started)1

Accumulatorb e

awork dest

Vector

ParallelThreads

Actions

sync b

Points-to Graph

Action Ordering

“All actions from

current threadhappen before

thread a starts

executing”

Analysis Result After Interaction

rd a, a

rd b, a

rd c, a

rd d, a

rd e, a

wr e, a

sync b, a

sync e, a

Roles of Intrathread and Interthread Analyses

• Basic Analysis• Intrathread analysis delivers parallel

interaction graph at each program point•records parallel threads•does not compute thread interaction

• Choose program point (end of method)• Interthread analysis delivers additional

precision at that program point• Does not exploit ordering information from

thread join constructs

Join Ordering

t = new Task();t.start();

“computation that runs in parallel with task t”

t.join();

“computation that runs after task t”

t.run();“computation

from task t”

Exploiting Join Ordering

• At join point• Interthread analysis delivers new

(more precise) parallel interaction graph

• Intrathread analysis uses new graph• No parallel interactions between

• Thread• Computation after join

Extensions

• Partial program analysis• can analyze method independent of

callers• can analyze method independent of

methods it invokes• can incrementally analyze callees to

improve precision• Dial down precision to improve efficiency• Demand-driven formulations

Key Ideas

• Explicitly represent potential interactions between analyzed and unanalyzed parts• Inside versus outside nodes and

edges• Escaped versus captured nodes• Precisely bound ignorance

• Exploit ordering information• intrathread (flow sensitive)• interthread (starts, edge orders, joins)

Analysis Uses

Overheads in Standard Execution and How to Eliminate Them

Enumeration

Accumulator2 5

1work dest

Vector

Intrathread Analysis Result from End of run Method

•Enumeration object is captured•Does not escape to caller•Does not escape to parallel

threads•Lifetime of Enumeration object

is bounded by lifetime of run•Can allocate Enumeration

object on call stack instead of heap

Accumulator

awork dest

Vector

ParallelThreads

Actions

sync b

Points-to Graph

Action Ordering

“All actions from current thread happen before

thread a startsexecuting”

rd a, a

rd b, a

rd c, a

rd d, a

rd e, a

wr e, a

sync b, a

sync e, a

•Vector object is captured•Multiple threads synchronize on

Vector object•But synchronizations from

different threads do not occur concurrently

•Can eliminate synchronization on Vector object

Interthread Analysis Result from End of generateTask Method

Accumulator

awork dest

Vector

ParallelThreads

Actions

sync b

Points-to Graph

Action Ordering

“All actions from current thread happen before

thread a startsexecuting”

rd a, a

rd b, a

rd c, a

rd d, a

rd e, a

wr e, a

sync b, a

sync e, a

•Vectors, Tasks, Integers captured

•Parent, child access objects•Parent completes accesses

before child starts accesses•Can allocate objects on child’s

per-thread heap

Interthread Analysis Result from End of generateTask Method

Thread Overhead

• Inefficient Thread Implementations• Thread Creation Overhead• Thread Management Overhead• Stack Overhead

• Use a more efficient thread implementation• User-level thread management• Per-thread heaps• Event-driven form

Standard Thread Implementation

return address

frame pointer

return address

frame pointer

•Call frames allocated on stack•Context Switch

• Save state on stack• Resume another thread

•One stack per thread

Standard Thread Implementation

return address

frame pointer

return address

frame pointer

save area

• Save state on stack• Resume another thread

•One stack per thread

Event-Driven Form

return address

frame pointer

return address

frame pointer

• Build continuation on heap• Copy out live variables• Return out of computation• Resume another continuation

•One stack per processor

resumemethod

Complications

• Standard thread models use blocking I/O• Automatically convert blocking I/O to

asynchronous I/O• Scheduler manages interleaving of

thread executions• Stack Allocatable Objects May Be Live

Across Blocking Calls• Transfer allocation to per-thread heap

Opportunity

• On a uniprocessor, compiler controls placement of context switch points

• If program does not hold lock across blocking call, can eliminate lock

Experimental Results

• MIT Flex Compiler System• Static Compiler• Native code for StrongARM

• Server Benchmarks • http, phone, echo, time

• Scientific Computing Benchmarks• water, barnes

Server Benchmark Characteristics

IR Size

(instrs)

Number of

Methods

PreAnalysis

Time (secs)

echo 4,639 131 28

time 4,573 136 29

http 10,643 292 103

phone 9,547 267 75

IntraThreadAnalysis

Time (secs)

InterThreadAnalysis

Time (secs)

Percentage of Eliminated Synchronization Operations

http phone time echo mtrt

Intrathread only

Interthread

Compilation Options for Performance Results

• Standard• kernel threads, synch included

• Event-Driven• event-driven, no synch at all

• +Per-Thread Heap• event-driven, no synch at all, per-

thread heap allocation

Throughput (Responses per Second)

Standard

Event-Driven

+Per-ThreadHeap

echo timehttp2K

http20K

water 25,583 335 1156

IR Size(instrs)

Number ofMethods

Total AnalysisTime (secs)

barnes 19,764 364 491

Pre AnalysisTime (secs)

Scientific Benchmark Characteristics

Compiler Options

0: Sequential C++1: Baseline - Kernel Threads2: Lightweight Threads3: Lightweight Threads + Stack Allocation4: Lightweight Threads + Stack Allocation

- Synchronization

Baseline

+Light

+Stack

-Synch

Execution Times

Proportion of Sequential C++ Execution Time

water small water barnes

Related Work

• Pointer Analysis for Sequential Programs• Chatterjee, Ryder, Landi (POPL 99)• Sathyanathan & Lam (LCPC 96)• Steensgaard (POPL 96)• Wilson & Lam (PLDI 95)• Emami, Ghiya, Hendren (PLDI 94)• Choi, Burke, Carini (POPL 93)

Related Work

• Pointer Analysis for Multithreaded Programs• Rugina and Rinard (PLDI 99) (fork-

join parallelism, not compositional)• We have extended our points-to analysis

for multithreaded programs (irregular, thread-based concurrency, compositional)

• Escape Analysis• Blanchet (POPL 98)• Deutsch (POPL 90, POPL 97)• Park & Goldberg (PLDI 92)

Related Work

• Synchronization Optimizations• Diniz & Rinard (LCPC 96, POPL 97)• Plevyak, Zhang, Chien (POPL 95)• Aldrich, Chambers, Sirer, Eggers

(SAS99)• Blanchet (OOPSLA 99)• Bogda, Hoelzle (OOPSLA 99)• Choi, Gupta, Serrano, Sreedhar, Midkiff

(OOPSLA 99)• Ruf (PLDI 00)

Conclusion

• New Analysis Algorithm• Flow-sensitive, compositional• Multithreaded programs• Explicitly represent interactions between

analyzed and unanalyzed parts• Analysis Uses

• Synchronization elimination• Stack allocation• Per-thread heap allocation

• Lightweight Threads

analyses and optimizations for multithreaded programs martin rinard, alex salcianu, brian demsky mit...

int sum

integer e

enumeration e

hasmoreelements sum

int value

w dest

javaclass accumulator

vector v

Documents

compiler support for distributed systems martin c. rinard...

afid: an automated fault identification tool alex edwards...

salcianu attorneys at law - the...

modular data structure verification viktor kuncak...

alexandre d. salcianu and martin c. rinard. definiciones un...

rinard orchid greenhouse presentation

mit 6.035 foundations of dataflow analysis martin rinard...

data structure repair brian demsky computer science and...

role-based exploration of object-oriented programs brian...

specification-based error localization brian demsky cristian...

pointer and escape analysis for multithreaded programs...

mit 6.035 top-down parsing martin rinard laboratory for...

pointer analysis for multithreaded programs radu rugina and...

university of california,...

pointer and escape analysis for (multithreaded) programs...

automatic detection and repair of errors in data structures...

analysis of multithreaded programs martin rinard laboratory...

reasoning about relaxed programs michael carbin deokhwan...

incrementalized pointer and escape analysis martin rinard...

lecture 1:...