introduction to parallel algorithms

30
Introduction to parallel algorithms Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University COT 5410 – Spring 2004

Upload: menora

Post on 13-Feb-2016

92 views

Category:

Documents


3 download

DESCRIPTION

Introduction to parallel algorithms. COT 5410 – Spring 2004. Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University. Outline. Background Primitives Algorithms Important points. Background. Terminology Time complexity Speedup Efficiency Scalability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to parallel algorithms

Introduction to parallel algorithms

Ashok Srinivasanwww.cs.fsu.edu/~asriniva

Florida State University

COT 5410 – Spring 2004

Page 2: Introduction to parallel algorithms

Outline• Background• Primitives• Algorithms• Important points

Page 3: Introduction to parallel algorithms

Background• Terminology

– Time complexity– Speedup– Efficiency– Scalability

• Communication cost model

Page 4: Introduction to parallel algorithms

Time complexity

• Parallel computation– A group of processors work together to

solve a problem– Time required for the computation is the

period from when the first processor starts working until when the last processor stops

Sequential Parallel - bad Parallel - ideal Parallel - realistic

Page 5: Introduction to parallel algorithms

Other terminology

• Speedup: S = T1/TP

• Efficiency: E = S/P• Work: W = P TP

• Scalability– How does TP decrease as we increase P to

solve the same problem?– How should the problem size increase with

P, to keep E constant?

Notation•P = Number of processors

•T1 = Time on one processor

•TP = Time on P processors

Page 6: Introduction to parallel algorithms

Communication cost model

• Processes spend some time doing useful work, and some time communicating

• Model communication cost as– TC = ts + L tb

– L = message size– Independent of location of processes– Any process can communicate with any other

process– A process can simultaneously send and receive

one message

Page 7: Introduction to parallel algorithms

I/O model

• We will ignore I/O issues, for the most part• We will assume that input and output are

distributed across the processors in a manner of our choosing

• Example: Sorting– Input: x1, x2, ..., xn

• Initially, xi is on processor i

– Output xp1, xp2

, ..., xpn

• xpi on processor i

• xpi < xpi+1

Page 8: Introduction to parallel algorithms

Primitives• Reduction• Broadcast• Gather/Scatter• All gather• Prefix

Page 9: Introduction to parallel algorithms

Reduction -- 1

• Tn = n-1 + (n-1)(ts+tb)

• Sn = 1/(1 + ts + tb)

x1

xn

x4x3

x2

Compute x1 + x2 + ... + xn

Page 10: Introduction to parallel algorithms

Reduction -- 2

• Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1

= n/2 + n/2 (ts+ tb)

• Sn ~ 2/(1 + ts+ tb)

Reduction-1

for {x1, ... xn/2} Reduction-1

for {xn/2+1, ... xn}

x1 xn/2+1

Page 11: Introduction to parallel algorithms

Reduction -- 3

• Apply reduction-2 recursively– Divide and conquer

• Tn ~ log2n + (ts+ tb) log2n

• Sn ~ (n/ log2n) x 1/(1 + ts+ tb)• Note that any associative operator can be used in place of +

Reduction-1

for {x1, ... xn/2}Reduction-1

for {xn/2+1, ... xn}

x1

Reduction-1

for {x1, ... xn/4}

Reduction-1

for {xn/4+1, ... xn/2}

x1

Reduction-1

for {xn/2+1, ... x3n/4}

Reduction-1

for {x3n/4+1, ... xn}

xn/2+1xn/4+1

xn/2+1

x3n/4+1

Page 12: Introduction to parallel algorithms

Parallel addition features• If n >> P

– Each processor adds n/P distinct numbers– Perform parallel reduction on P numbers– TP ~ n/P + (1 + ts+ tb) log P– Optimal P obtained by differentiating wrt P

• Popt ~ n/(1 + ts+ tb)• If communication cost is high, then fewer processors ought

to be used – E = [1 + (1+ ts+ tb) P log P/n]-1

• As problem size increases, efficiency increases• As number of processors increases, efficiency

decreases

Page 13: Introduction to parallel algorithms

Some common collective operations

A

A

A

A

A

Broadcast

A

B

C

D

A, B, C, D

Gather

A

A, B, C, D

B

C

D

Scatter

A

B

C

D

A, B, C, D

All Gather

A, B, C, D

A, B, C, D

A, B, C, D

Page 14: Introduction to parallel algorithms

Broadcast

• T ~ (ts+ Ltb) log P– L: Lenght of data

x1

x8x7

x3

x5

x2

x6

x4

x1

x1

x1 x4

x2

x3 x2

x8

x4

x2

x6

x1

x7

x3

x5

Page 15: Introduction to parallel algorithms

Gather/Scatter

• Gather: Data move towards the root• Scatter: Review question• T ~ ts log P + PLtb

x18

x8x4

x34

x2

x58

x6

x78

x14

x12

x1 x7

x56

x3 x5

L

4L

2L 2L

L L L

Note: i=0log P–1 2i

= (2 log P – 1)/(2–1) = P-1

~ P

Page 16: Introduction to parallel algorithms

All gatherx8

x4

x2

x6

x1

x7

x3

x5

• Equivalent to each processor broadcasting to all the processors

L

Page 17: Introduction to parallel algorithms

All gather

x12

x56

x12

x78

x34

x56

x34

x78

L

2L

Page 18: Introduction to parallel algorithms

All gather

x14

x58

x14

x58

x14

x14

x58

x58

4L

2L

L

Page 19: Introduction to parallel algorithms

All gather

• Tn ~ ts log P + PLtb

x18

x18

x18

x18

x18

x18

x18

x18

4L

2L

L

Page 20: Introduction to parallel algorithms

Review question: Pipelining

• Useful when repeatedly and regularly performing a large number of primitive operations– Optimal time for a broadcast = log P

• But doing this n times takes n log P time

– Pipelining the broadcasts takes n + P time• Almost constant amortized time per broadcast

– if n >> P• n + P << n log P when n >> P • Review question: How can you accomplish this time

complexity?

Page 21: Introduction to parallel algorithms

Sequential prefix• Input

– Values xi , 1 < i < n

• Output– Xi = x1 * x2 * ... * xi, 1 < i < n– * is an associative operator

• Algorithm– X1 = x1

– for i = 2 to n• Xi = Xi-1 * xi

Page 22: Introduction to parallel algorithms

Parallel prefix• Input

– Processor i has xi

• Output– Processor i has x1 * x2 * ... * xi

• Divide and conquer– f(a,b) yields the following

• Xi = xa *... * xi, Proc Pi

• Xi = xa *... * xb, Proc Pi

• a < i < b– f(1,n) solves the problem

• Define f(a,b) as follows– if a == b

• Xi = xi, on Proc Pi

• Xi = xi, on Proc Pi

– else• compute in parallel

– f(a,(a+b)/2)– f((a+b)/2+1,b)

• Pi and Pj send Xi and Xj to each other, respectively

– a < i < (a+b)/2– j = i + (a+b)/2

• Xi = Xi*Xj on Pi

• Xj = Xi*Xj on Pj

• Xj = Xi*Xj on Pj

– T(n) = t(n/2) + 2 + (ts+tw) => T(n) = O(log n)– An iterative implementation improves the constant

Page 23: Introduction to parallel algorithms

Iterative parallel prefix example

x0 x1 x2 x3 x4 x5 x6 x7

x01 x12 x23 x34 x45 x56 x67

x02 x03 x14 x25 x36 x47

x04 x05 x06 x07

Page 24: Introduction to parallel algorithms

Algorithms• Linear recurrence• Matrix vector multiplication

Page 25: Introduction to parallel algorithms

Linear recurrence

• Determine each xi, 2 < i < n– xi = ai xi-1 + bi xi-2

– x0 = x0, x1 = x1

• Sequential solution– for i = 2 to n

• xi = ai xi-1 + bi xi-2

– Follows directly from the recurrence– This approach is not easily parallelized

Page 26: Introduction to parallel algorithms

Linear recurrence in parallel• Given xi = ai xi-1 + bi xi-2

– x2i = a2i x2i-1 + b2i x2i-2

– x2i+1 = a2i+1 x2i + b2i+1 x2i-1

• Rewrite this in matrix formx2i

x2i+1

b2i a2i

a2i+1 b2i b2i+1 + a2i+1 a2i

x2i-2

x2i-1

XiAi Xi-1

• Xi = Ai A i-1 ... A1X0

• This is a parallel prefix computation, since matrix multiplication is associative

• Solved in O(log n) time

Page 27: Introduction to parallel algorithms

Matrix-vector multiplication• c = A b

– Often performed repeatedly• bi = A bi-1

– We need same data distribution for c and b• One dimensional decomposition

– Example: row-wise block striped for A• b and c replicated

– Each process computes its components of c independently

– Then all-gather the components of c

Page 28: Introduction to parallel algorithms

1-D matrix-vector multiplication

• Each process computes its components of c independently– Time = (n2/P)

• Then all-gather the components of c– Time = ts log P + tb n

• Note: P < n

c: Replicated A: Row-wise b: Replicated

Page 29: Introduction to parallel algorithms

2-D matrix-vector multiplication

• Processes Pi0 sends Bi to P0i

– Time: ts + tbn/P0.5

• Processes P0j broadcast Bj to all Pij

– Time = ts log P0.5 + tb n log P0.5 / P0.5

• Processes Pij compute Cij = AijBj– Time = (n2/P)

• Processes Pij reduce Cij on to Pi0, 0 < i < P0.5 – Time = ts log P0.5 + tb n log P0.5 / P0.5

• Total time = (n2/P + ts log P + tb n log P / P0.5 )– P < n2

– More scalable than 1-dimensional decomposition

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

B0

B1

B2

B3

C0

C1

C2

C3

Page 30: Introduction to parallel algorithms

Important points• Efficiency

– Increases with increase in problem size– Decreases with increase in number of processors

• Aggregation of tasks to increase granularity– Reduces communication overhead

• Data distribution– 2-dimensional may be more scalable than 1-dimensional– Has an effect on load balance too

• General techniques– Divide and conquer– Pipelining