scheduling streaming computations

37
SCHEDULING STREAMING COMPUTATIONS Kunal Agrawal

Upload: devaki

Post on 23-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Scheduling Streaming Computations. Kunal Agrawal. The Streaming Model. Computation is represented by a directed graph: Nodes: Computation Modules. Edges: FIFO Channels between nodes. Infinite input stream. We only consider acyclic graphs ( dags ). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scheduling Streaming Computations

SCHEDULING STREAMING COMPUTATIONSKunal Agrawal

Page 2: Scheduling Streaming Computations

2

THE STREAMING MODEL Computation is represented by a

directed graph: Nodes: Computation Modules. Edges: FIFO Channels between

nodes. Infinite input stream. We only consider acyclic graphs

(dags).

When modules fire, they consume data from incoming channels and produce data on outgoing channels.

Page 3: Scheduling Streaming Computations

CACHE-CONSCIOUS SCHEDULING OF STREAMING APPLICATIONSwith Jeremy T. Fineman, Jordan Krage, Charles E. Leiserson, and Sivan Toledo

GOAL: Schedule the computation to minimize the number of cache misses on a sequential machine.

Page 4: Scheduling Streaming Computations

DISK ACCESS MODEL

The cache has M/B blocks each of size B. Cost = Number of cache misses. If CPU accesses data in cache, the cost is 0. If CPU accesses data not in cache, then there

is a cache miss of cost 1. The block containing the requested data is read into cache.

If the cache is full, some block is evicted from cache to make room for new blocks.

CPU SlowMemory

block M/B

cache

B

Page 5: Scheduling Streaming Computations

CONTRIBUTIONS The problem of minimizing cache

misses is reduced to a problem of graph partitioning.

THEOREM: If the optimal algorithm has X cache misses given a cache of size M, there exists a partitioned schedule that incurs O(X) cache misses given a cache of size O(M).

In other words, some partitioned schedule is O(1) competitive given O(1) memory augmentation.

Page 6: Scheduling Streaming Computations

OUTLINE

Cache Conscious Scheduling Streaming Application Model The Sources of Cache Misses and Intuition

Behind Partitioning Proof Intuition Thoughts

Deadlock Avoidance Model and Source of Deadlocks Deadlock Avoidance Using Dummy Items. Thoughts

Page 7: Scheduling Streaming Computations

STREAMING APPLICATIONS

When a module v fires, it must load s(v) state, consumes i(u,v) items from incoming edge(s) (u,v),

and produces o(v,w) items on outgoing edge(s) (v,w).

s:60i:1

s:20

s:40s:35

o:2o:4

i:4

i:4

o:2

o:1

i:1

i:1o:1

ab

c

d

ASSUMPTIONS: All items are unit sized. The source consumes 1 item each time it fires. Input/output rates and state sizes are known. The state size of modules is at most M.

Page 8: Scheduling Streaming Computations

DEFINITION: GAIN

VERTEX GAIN: Number of vertex u firings per source firing. , where p is a path from s to u.

s:60i:1

s:20

s:40s:35

o:2o:4

i:4

i:4

o:2

o:1

i:1

i:1o:1

gain(u) = o(x,y) /i(x,y)(x,y )∈p∏

gain(u,v) = gain(u).o(u,v)

EDGE GAIN: The number of items produced along the edge (u,v) per source firing.

A graph is well-formed iff all gains are well-defined.

gain: 1/2

gain: 1gain: 1

ab

c

d

Page 9: Scheduling Streaming Computations

OUTLINE

Cache Conscious Scheduling Streaming Application Model The Sources of Cache Misses and Intuition

Behind Partitioning Proof Intuition Thoughts

Deadlock Avoidance Model and Source of Deadlocks Deadlock Avoidance Using Dummy Items. Thoughts

Page 10: Scheduling Streaming Computations

CACHE MISSES DUE TO STATE LOAD

STRATEGY: Push items through.

s:60 s:20 s:40 s:351 2 1 4 8 1 1 1

B:1, M:100

COST PER INPUT ITEM: The sum of the state sizes

=Ω s(u)u∑( ).

Cache SlowMemory

IDEA: Reuse the state once loaded.

Page 11: Scheduling Streaming Computations

CACHE MISSES DUE TO DATA ITEMSs1:60 s2:20 s3:40 s4:35

1 2 1 4 8 1 1 1

STRATEGY: Once loaded, execute module many times by adding large buffers between modules.COST PER INPUT ITEM: Total number of items produced on all channels per input item

=Θ gain(u,v)(u,v )∑( ). B:1, M:100

Cache SlowMemory

Page 12: Scheduling Streaming Computations

PARTITIONING: REDUCE CACHE MISSESs1:60 s2:20 s3:40 s4:351 2 1 4 8 1 1 1

STRATEGY: Partition into segments that fit in cache and only add buffers on cross edges C --- edges that go between partitions.COST PER INPUT ITEM:

Θ gain(u,v)(u,v )∈C∑( ). B:1, M:100

Cache SlowMemory

Page 13: Scheduling Streaming Computations

WHICH PARTITION?s1:60 s2:20 s3:40 s4:351 2 1 4 8 1 1 1

STRATEGY: Partition into segments that fit in cache and only add buffers on cross edges C --- edges that go between partitions.

LESSON: Cut small gain edges. COST PER INPUT ITEM:

Θ gain(u,v)(u,v )∈C∑( ). B:1, M:100

Cache SlowMemory

Page 14: Scheduling Streaming Computations

OUTLINE

Cache Conscious Scheduling Streaming Application Model The Sources of Cache Misses and Intuition

Behind Partitioning Proof Intuition Thoughts

Deadlock Avoidance Model and Source of Deadlocks Deadlock Avoidance Using Dummy Items. Thoughts

Page 15: Scheduling Streaming Computations

IS PARTITIONING GOOD? Show that the optimal scheduler can not do much

better than the best partitioned scheduler.

THEOREM: On processing T items, if the optimal algorithm given M-sized cache has X cache misses, then some partitioning algorithm given O(M) cache has at most O(X) cache misses.

The number of cache misses due to a partitioned scheduler is The best partitioned scheduler should minimize

We must prove the matching lower bound on the optimal scheduler’s cache misses.

gain(u,v)(u,v )∈C∑ .

Θ gain(u,v)(u,v )∈C∑( ).

Page 16: Scheduling Streaming Computations

OPTIMAL SCHEDULER WITH CACHE MS: segment with state size at least 2M. e = gm(S): the edge with the minimum gain within S.

u ve

Su fires X times. CASE 1: At least 1 item produced by u is processed by v.

Cost

=Ω(M).

CASE 2: All items are buffered within S. The cheapest place to buffer is at e. Cost If , Cost

=Ω(X .gain(e) /gain(u)).

X = 2M .gain(u) /gain(e)

=Ω(M).

=Ω(gain(e) /gain(u)).In both cases, Cost/firing of u

Page 17: Scheduling Streaming Computations

LOWER BOUND

Source node fires T times. Consider the optimal scheduler with M cache.

Number of firings of ui

Cost due to Si per firing of ui

Total cost due to Si

Total Cost over all segments

ui vi

Divide the pipeline into segments of size between 2M and 3M.

Si

ei

=Ω(gain(e) /gain(u)).

=Ω(T.gain(e)).

=Ω T gain(ei)∑( ).

e1 ek

=T.gain(ui).

Page 18: Scheduling Streaming Computations

MATCHING UPPER BOUND

Source node fires T times. Cost of optimal scheduler with M cache

Consider the partitioned schedule that cuts all ei. Each segment has size at most 6M. The total cost of that schedule is

Therefore, if this partitioned schedule has constant factor memory augmentation, it provides constant-competitiveness in the number of cache misses.

ui vi

Si

ei

=O T gain(ei)∑( ).

e1 ek

=Ω T gain(ei)∑( ).

Divide the pipeline into segments of size between 2M and 3M.

Page 19: Scheduling Streaming Computations

GENERALIZATION TO DAG Say we partition a DAG such that

Each component has size at most O(M).

When contracted, the components form a dag.

Page 20: Scheduling Streaming Computations

GENERALIZATION TO DAG Say we partition a DAG such that

Each component has size at most O(M).

When contracted, the components form a dag.

If C is the set of cross edges,is minimized over all such partitions.

The optimal schedule has cost/item .

Given constant factor memory augmentation, a partitioned schedule has cost/item .

O gain(u,v)(u,v )∈C∑( )

gain(u,v)(u,v )∈C∑

Ω gain(u,v)(u,v )∈C∑( )

Page 21: Scheduling Streaming Computations

WHEN B ≠ 1 LOWER BOUND: The optimal algorithm has

cost

Ω 1/B gain(u,v)(u,v )∈C∑( )

UPPER BOUND: With constant factor memory augmentation: Pipelines: Upper bound matches the lower

bound. DAGs: Upper bound matches the lower

bound as long as each component of the partition has O(M/B) incident cross edges.

Page 22: Scheduling Streaming Computations

FINDING A GOOD PARTITION

For pipelines, we can find a good-enough partition greedily and the best partition using dynamic programming.

For general DAGs, finding the best partition is NP-complete.

Our proof is approximation-preserving. An approximation algorithm for the partitioning problem, will work for our problem.

Page 23: Scheduling Streaming Computations

CONCLUSIONS AND FUTURE WORK We can reduce the problem of minimizing

cache misses to the problem of calculating the best partition.

Solving the partitioning problem: Approximation algorithms. Exact solution for special cases such as SP-DAGs.

Space bounds: Bound the buffer sizes on cross edges.

Cache-conscious scheduling for multicores.

Page 24: Scheduling Streaming Computations

DEADLOCK AVOIDANCE FOR STREAMING COMPUTATIONS WITH FILTERINGwith Peng Li, Jeremy Buhler, and Roger D. Chamberlain

GOAL: Devise mechanisms to avoid deadlocks on applications with filtering and finite buffers.

Page 25: Scheduling Streaming Computations

OUTLINE

Cache Conscious Scheduling Streaming Application Model The Sources of Cache Misses and Intuition

Behind Partitioning Proof Intuition Thoughts

Deadlock Avoidance Model and Source of Deadlocks Deadlock Avoidance Using Dummy Items. Thoughts

Page 26: Scheduling Streaming Computations

FILTERING APPLICATIONS MODEL Data dependent filtering: The number of items

produced depends on the data. When a node fires, it

has a compute index (CI), which monotonically increases, consumes/produces 0 or 1 items from input/output

channels, input/output items must have index = CI.

A node can not proceed until it is sure that it has received all items of its current CI.

Channels can have unbounded delays.

1 A item with index 1

U1112

233

12

2ABC

XY123

Compute index

Page 27: Scheduling Streaming Computations

A DEADLOCK DEMOFiltering can cause deadlocks due to finite buffers.

u

v

w

x1

43

6

32

12full

full

empty empty

5

34

A deadlock example (channel buffer size is 3).

Page 28: Scheduling Streaming Computations

CONTRIBUTIONS

Deadlock avoidance mechanism using dummy or heartbeat messages sent at regular intervals Provably correct --- guarantees deadlock

freedom. No global synchronization. No dynamic buffer resizing.

Efficient algorithms to compute dummy intervals for structured DAGs such as series parallel DAGs and CS4 DAGs

Page 29: Scheduling Streaming Computations

OUTLINE

Cache Conscious Scheduling Streaming Application Model The Sources of Cache Misses and Intuition

Behind Partitioning Proof Intuition Thoughts

Deadlock Avoidance Model and Source of Deadlocks Deadlock Avoidance Using Dummy Items. Thoughts

Page 30: Scheduling Streaming Computations

THE NAÏVE ALGORITHM

Filtering Theorem If no node ever filters any token, then the system cannot deadlock

The Naïve Algorithm Sends a dummy on every filtered item. Changes a filtering system to a non-filtering system.

u2 A X1 2 1

1 1A token with index 1

A dummy with index 1

Page 31: Scheduling Streaming Computations

COMMENTS ON THE NAÏVE ALGORITHM

Pros Easy to schedule dummy items

Cons Doesn’t utilize channel buffer sizes. Sends many unnecessary dummy items, wasting

both computation and bandwidth.

Next step, reduce the number of dummy items.

Page 32: Scheduling Streaming Computations

THE PROPAGATION ALGORITHM Computes a static dummy schedule. Sends dummies periodically based on dummy

intervals. Dummy items must be propagated to all

downstream nodes.

u

v

w

x1

453

6 4

32

12

56

6 63

3, 8

4, 6

3, ∞

4, ∞

Channel buffer size

Dummy interval

4

Comp. Index: 6Index of last dummy: 06 – 0 >= 6, send a dummy

Page 33: Scheduling Streaming Computations

COMMENTS ON THE PROPAGATION ALGORITHM

Pros Takes advantage of channel buffer sizes. Greatly reduces the number of dummy items

compared to the Naïve Algorithm.

Cons Does not utilize filtering history. Dummy items must be propagated.

Next step, eliminate propagation Use shorter dummy intervals. Use filtering history for dummy scheduling.

Page 34: Scheduling Streaming Computations

THE NON-PROPAGATION ALGORITHM Send dummy items based on filtering history Dummy items do not propagate. If (index of filtered item – index of previous

token/dummy) >= dummy interval, send a dummy

u

v

w

x1

453

6 4

32

12

5

33

3, 4

4, 3

3, 4

4, 3

Channel buffer size

Dummy interval

4Data filteredCurrent Index: 3Index of last token/dummy: 03 – 0 >= 3, send a dummy

Page 35: Scheduling Streaming Computations

COMPARISON OF THE ALGORITHMS

Performance measurement # of dummies sent Fewer dummies are better

Non-Propagation Algorithm is expected to be the best in most cases

Experimental data Mercury BLASTN (biological

app.) 787 billion input elements

32 256 204850000

500000

5000000

50000000

500000000

5000000000

50000000000

500000000000

5000000000000

Naïve

Prop.

Non-Prop.

Page 36: Scheduling Streaming Computations

HOW DO WE COMPUTE THESE INTERVALS

Exponential time algorithms for general DAGs, since we have to enumerate cycles.

Can we do better for structured DAGs? Yes. Polynomial time algorithms for SP DAGs Polynomial time algorithms for CS4 DAGs --- a

class of DAGs where every undirected cycle has a single source and a single sink.

Page 37: Scheduling Streaming Computations

CONCLUSIONS AND FUTURE WORK

Designed efficient deadlock-avoidance algorithms using dummy messages.

Find polynomial algorithms to compute dummy interval for general DAGs.

Consider general models: allowing multiple outputs from one input and feedback loops.

The reverse problem: computing efficient buffer sizes from dummy intervals.