distributed computing paradigms

24
Distributed Computing Paradigms Bag of Tasks Heartbeat Algorithms Pipelining

Upload: salali

Post on 23-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Distributed Computing Paradigms. Bag of Tasks Heartbeat Algorithms Pipelining. Bag of Tasks. Basic idea Server maintains a list (bag) of tasks to be performed Clients send requests to server, receive a task, then may send back additional tasks to server Main advantage - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Computing Paradigms

Distributed Computing Paradigms

• Bag of Tasks• Heartbeat Algorithms• Pipelining

Page 2: Distributed Computing Paradigms

Bag of Tasks

• Basic idea– Server maintains a list (bag) of tasks to be

performed– Clients send requests to server, receive a task,

then may send back additional tasks to server• Main advantage

– Automatic load balancing• When client needs work, gets some from server

Page 3: Distributed Computing Paradigms

Bag of Tasks Pseudocodestruct task { field1; field2; …; fieldN}process Manager { create initial task taskQ.enqueue(task) while (1) { in getTask(task) st taskQ.size > 0 -> task = taskQ.dequeue( ) // task is ref. param [] putTask(task) -> taskQ.enqueue(task) [] result(val) -> incorporate val into final answer ni }}

Page 4: Distributed Computing Paradigms

Bag of Tasks Pseudocodeprocess Worker { while (1) { call getTask(task) work on task if (this task does not need to be subdivided further) send result(val) // val is whatever the "answer" is else { // send task or tasks back into the bag send putTask(task1) send putTask(task2) ... send putTask(taskN) } }}

Page 5: Distributed Computing Paradigms

Bag of Tasks

• Worker should keep a task for itself– When it finishes its task, it needs another– Relatively simple to restructure code on last two

slides• Want to avoid too many tasks

– Just as in recursive parallelism• Termination is in deadlock

– Some languages allows this (they detect deadlock)– Need more complicated design to avoid this

Page 6: Distributed Computing Paradigms

Heartbeat Algorithms

• Useful for data parallel, iterative applications• Basic idea:

initialize local variableswhile not done

send values to neighborsreceive values from neighborsperform local computations

Page 7: Distributed Computing Paradigms

Parallel Jacobi Iteration---Distributed-Memory Version

process worker [k = 0 to p-1] { double A[n][n], B[n][n] startrow = k * n / p; endrow = (k+1) * n/p – 1 modify startrow and endrow to account for boundaries if (k == 0) { for j = 1 to p-1 { sr= j * n / p; er = (j+1) * n/p – 1 // need to adjust these also send B[sr:er][0:n-1] to process j } else receive B[startrow:endrow][0:n-1] from process 0

Page 8: Distributed Computing Paradigms

Parallel Jacobi Iteration---Distributed-Memory Version

while not done { send top row of B to k-1; bottom row of B to k+1 receive “top-1” row of B from k-1; “bot+1” from k+1 for i = startrow to endrow for j = 1 to n-2 { A[i][j] = avg(4-point stencil of B’s) / 4 diff = fabs(A[i][j] – B[i][j]) maxdiff = max(maxdiff, diff) }

Page 9: Distributed Computing Paradigms

Parallel Jacobi Iteration---Distributed-Memory Version

if (k != 0) send maxdiff to 0; receive done from process 0 else { for j = 1 to p-1 receive maxdiff from process j done = (max(maxdiffs) < TOLERANCE) for j = 1 to p-1 send done to process j } } // end of while loop} // end of process statement

Page 10: Distributed Computing Paradigms

Region Labeling

• Assume that we are given an image with a given number of lit pixels

• Goal: find sets of connected lit pixels; give each set a unique label

Page 11: Distributed Computing Paradigms

Region Labelingprocess worker [k = 0 to P-1] { double pixel[N][N], label[N][N] initialize pixel and label arrays in my strip (label elements

are zero if pixel off and unique if pixel on) send/receive top/bottom rows of pixel array to neighbor while not done send/receive top/bottom rows of label to neighbor update label array in my strip (see next slide) if any label[i][j] changed, send “yes” to coordinator receive whether to continue from coordinator} // end of worker

Page 12: Distributed Computing Paradigms

Region LabelingUpdating label array done as follows: for each of my pixels if pixel[i][j] == 1 for each (N/E/W/S) neighbor who has pixel[i][j] == 1 if neighbor's label[i][j] > my label[i][j] replace my label[i][j] with neighbor's

Note: this algorithm takes time N2 if there is a “snake”

Page 13: Distributed Computing Paradigms

Region Labeling

process coordinator{ for each iteration collect change flag from each process (yes or no) if at least one process says yes send “go on” to each process else send “done to each process and exit} // end of coordinator

Page 14: Distributed Computing Paradigms

Optimizations

• Strip mining: original codefor i = 1 to n { for j = 1 to n { A[i][j] = f(A[i-1][j]) }}

(j loop is parallel)

Problem is that j loop is likely too fine-grain to parallelize

Page 15: Distributed Computing Paradigms

Optimizations

• Strip mining: could try loop interchangefor j = 1 to n { for i = 1 to n { A[i][j] = f(A[i-1][j]) }}

(Interchange loops)

Problem is that now there are cache problems with parallelizing j loop

Page 16: Distributed Computing Paradigms

Optimizations

• Strip mining---first stepfor i = 1 to n { for k = 1 to n by blocksize{ for j = k to k + blocksize - 1{ A[i][j] = f(A[i-1][j]) } }}

(Add an extra loop)

Adding extra loops is generally not a good idea! What’s going on?

Page 17: Distributed Computing Paradigms

Optimizations

• Strip mining---second stepfor k = 1 to n by blocksize{ for i = 1 to n { for j = k to k + blocksize - 1{ A[i][j] = f(A[i-1][j]) } }}

(Interchange first two loops)

Still, we transformed two loops into 3. Again…why is this a good idea?

Page 18: Distributed Computing Paradigms

Pipelining

for i = 1 to n for j = 1 to n A[i][j] = f(A[i][j-1]) for i = 1 to n for j = 1 to n A[i][j] = f(A[i-1][j])

How do we parallelize this program?

Page 19: Distributed Computing Paradigms

Options to Parallelize

• Transpose array between first and second loop nests (and between second and first loop nests!)– Advantage: full parallelization in both loops– Disadvantage: transpose is slow (requires all-to-all)

• Sequentialize second loop nest– Bad idea: requires data movement and is sequential

• Pipeline second phase– Avoids data redistribution, but suffers pipeline

latency and requires a larger number of messages

Page 20: Distributed Computing Paradigms

Pipelining Step 1: rewrite second loop nest via strip mining

for i = start to end for j = 1 to n A[i][j] = f(A[i][j-1]) for k = 1 to n by blocksize for i = 1 to n for j = k to k + blocksize - 1 A[i][j] = f(A[i-1][j])

We transformed two loops into 3. Again, why is this a good idea?

Page 21: Distributed Computing Paradigms

Pipelining Step 2: partition middle loop

for i = 1 to n for j = 1 to n A[i][j] = f(A[i][j-1]) for k = 1 to n by blocksize for i = start to end for j = k to k + blocksize - 1 A[i][j] = f(A[i-1][j])

We transformed two loops into 3. Again, why is this a good idea?

Page 22: Distributed Computing Paradigms

Pipelining Step 3: Insert Synchronization/Communication

• Code for process Xfor k = 1 to n by blocksize { receive predecessor subrow from process X-1 for i = start to end { for j = k to k + blocksize - 1{ A[i][j] = f(A[i-1][j]) } } send bottom subrow just computed to process X+1}

Assumes target is a cluster and X is not a boundary process

Page 23: Distributed Computing Paradigms

Picture of pipelining

Page 24: Distributed Computing Paradigms

Pipelining Step 3: Insert Synchronization/Communication

• Code for process X (initialize s to {0,0,…0}for k = 1 to n by blocksize { P(s[X-1]) for i = start to end { for j = k to k + blocksize - 1{ A[i][j] = f(A[i-1][j]) } } V(s[X])}

Assumes target is multicore machine and X is not a boundary process