introduction to parallel algorithms
DESCRIPTION
Introduction to parallel algorithms. COT 5410 – Spring 2004. Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University. Outline. Background Primitives Algorithms Important points. Background. Terminology Time complexity Speedup Efficiency Scalability - PowerPoint PPT PresentationTRANSCRIPT
Introduction to parallel algorithms
Ashok Srinivasanwww.cs.fsu.edu/~asriniva
Florida State University
COT 5410 – Spring 2004
Outline• Background• Primitives• Algorithms• Important points
Background• Terminology
– Time complexity– Speedup– Efficiency– Scalability
• Communication cost model
Time complexity
• Parallel computation– A group of processors work together to
solve a problem– Time required for the computation is the
period from when the first processor starts working until when the last processor stops
Sequential Parallel - bad Parallel - ideal Parallel - realistic
Other terminology
• Speedup: S = T1/TP
• Efficiency: E = S/P• Work: W = P TP
• Scalability– How does TP decrease as we increase P to
solve the same problem?– How should the problem size increase with
P, to keep E constant?
Notation•P = Number of processors
•T1 = Time on one processor
•TP = Time on P processors
Communication cost model
• Processes spend some time doing useful work, and some time communicating
• Model communication cost as– TC = ts + L tb
– L = message size– Independent of location of processes– Any process can communicate with any other
process– A process can simultaneously send and receive
one message
I/O model
• We will ignore I/O issues, for the most part• We will assume that input and output are
distributed across the processors in a manner of our choosing
• Example: Sorting– Input: x1, x2, ..., xn
• Initially, xi is on processor i
– Output xp1, xp2
, ..., xpn
• xpi on processor i
• xpi < xpi+1
Primitives• Reduction• Broadcast• Gather/Scatter• All gather• Prefix
Reduction -- 1
• Tn = n-1 + (n-1)(ts+tb)
• Sn = 1/(1 + ts + tb)
x1
xn
x4x3
x2
Compute x1 + x2 + ... + xn
Reduction -- 2
• Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1
= n/2 + n/2 (ts+ tb)
• Sn ~ 2/(1 + ts+ tb)
Reduction-1
for {x1, ... xn/2} Reduction-1
for {xn/2+1, ... xn}
x1 xn/2+1
Reduction -- 3
• Apply reduction-2 recursively– Divide and conquer
• Tn ~ log2n + (ts+ tb) log2n
• Sn ~ (n/ log2n) x 1/(1 + ts+ tb)• Note that any associative operator can be used in place of +
Reduction-1
for {x1, ... xn/2}Reduction-1
for {xn/2+1, ... xn}
x1
Reduction-1
for {x1, ... xn/4}
Reduction-1
for {xn/4+1, ... xn/2}
x1
Reduction-1
for {xn/2+1, ... x3n/4}
Reduction-1
for {x3n/4+1, ... xn}
xn/2+1xn/4+1
xn/2+1
x3n/4+1
Parallel addition features• If n >> P
– Each processor adds n/P distinct numbers– Perform parallel reduction on P numbers– TP ~ n/P + (1 + ts+ tb) log P– Optimal P obtained by differentiating wrt P
• Popt ~ n/(1 + ts+ tb)• If communication cost is high, then fewer processors ought
to be used – E = [1 + (1+ ts+ tb) P log P/n]-1
• As problem size increases, efficiency increases• As number of processors increases, efficiency
decreases
Some common collective operations
A
A
A
A
A
Broadcast
A
B
C
D
A, B, C, D
Gather
A
A, B, C, D
B
C
D
Scatter
A
B
C
D
A, B, C, D
All Gather
A, B, C, D
A, B, C, D
A, B, C, D
Broadcast
• T ~ (ts+ Ltb) log P– L: Lenght of data
x1
x8x7
x3
x5
x2
x6
x4
x1
x1
x1 x4
x2
x3 x2
x8
x4
x2
x6
x1
x7
x3
x5
Gather/Scatter
• Gather: Data move towards the root• Scatter: Review question• T ~ ts log P + PLtb
x18
x8x4
x34
x2
x58
x6
x78
x14
x12
x1 x7
x56
x3 x5
L
4L
2L 2L
L L L
Note: i=0log P–1 2i
= (2 log P – 1)/(2–1) = P-1
~ P
All gatherx8
x4
x2
x6
x1
x7
x3
x5
• Equivalent to each processor broadcasting to all the processors
L
All gather
x12
x56
x12
x78
x34
x56
x34
x78
L
2L
All gather
x14
x58
x14
x58
x14
x14
x58
x58
4L
2L
L
All gather
• Tn ~ ts log P + PLtb
x18
x18
x18
x18
x18
x18
x18
x18
4L
2L
L
Review question: Pipelining
• Useful when repeatedly and regularly performing a large number of primitive operations– Optimal time for a broadcast = log P
• But doing this n times takes n log P time
– Pipelining the broadcasts takes n + P time• Almost constant amortized time per broadcast
– if n >> P• n + P << n log P when n >> P • Review question: How can you accomplish this time
complexity?
Sequential prefix• Input
– Values xi , 1 < i < n
• Output– Xi = x1 * x2 * ... * xi, 1 < i < n– * is an associative operator
• Algorithm– X1 = x1
– for i = 2 to n• Xi = Xi-1 * xi
Parallel prefix• Input
– Processor i has xi
• Output– Processor i has x1 * x2 * ... * xi
• Divide and conquer– f(a,b) yields the following
• Xi = xa *... * xi, Proc Pi
• Xi = xa *... * xb, Proc Pi
• a < i < b– f(1,n) solves the problem
• Define f(a,b) as follows– if a == b
• Xi = xi, on Proc Pi
• Xi = xi, on Proc Pi
– else• compute in parallel
– f(a,(a+b)/2)– f((a+b)/2+1,b)
• Pi and Pj send Xi and Xj to each other, respectively
– a < i < (a+b)/2– j = i + (a+b)/2
• Xi = Xi*Xj on Pi
• Xj = Xi*Xj on Pj
• Xj = Xi*Xj on Pj
– T(n) = t(n/2) + 2 + (ts+tw) => T(n) = O(log n)– An iterative implementation improves the constant
Iterative parallel prefix example
x0 x1 x2 x3 x4 x5 x6 x7
x01 x12 x23 x34 x45 x56 x67
x02 x03 x14 x25 x36 x47
x04 x05 x06 x07
Algorithms• Linear recurrence• Matrix vector multiplication
Linear recurrence
• Determine each xi, 2 < i < n– xi = ai xi-1 + bi xi-2
– x0 = x0, x1 = x1
• Sequential solution– for i = 2 to n
• xi = ai xi-1 + bi xi-2
– Follows directly from the recurrence– This approach is not easily parallelized
Linear recurrence in parallel• Given xi = ai xi-1 + bi xi-2
– x2i = a2i x2i-1 + b2i x2i-2
– x2i+1 = a2i+1 x2i + b2i+1 x2i-1
• Rewrite this in matrix formx2i
x2i+1
b2i a2i
a2i+1 b2i b2i+1 + a2i+1 a2i
x2i-2
x2i-1
XiAi Xi-1
• Xi = Ai A i-1 ... A1X0
• This is a parallel prefix computation, since matrix multiplication is associative
• Solved in O(log n) time
Matrix-vector multiplication• c = A b
– Often performed repeatedly• bi = A bi-1
– We need same data distribution for c and b• One dimensional decomposition
– Example: row-wise block striped for A• b and c replicated
– Each process computes its components of c independently
– Then all-gather the components of c
1-D matrix-vector multiplication
• Each process computes its components of c independently– Time = (n2/P)
• Then all-gather the components of c– Time = ts log P + tb n
• Note: P < n
c: Replicated A: Row-wise b: Replicated
2-D matrix-vector multiplication
• Processes Pi0 sends Bi to P0i
– Time: ts + tbn/P0.5
• Processes P0j broadcast Bj to all Pij
– Time = ts log P0.5 + tb n log P0.5 / P0.5
• Processes Pij compute Cij = AijBj– Time = (n2/P)
• Processes Pij reduce Cij on to Pi0, 0 < i < P0.5 – Time = ts log P0.5 + tb n log P0.5 / P0.5
• Total time = (n2/P + ts log P + tb n log P / P0.5 )– P < n2
– More scalable than 1-dimensional decomposition
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B0
B1
B2
B3
C0
C1
C2
C3
Important points• Efficiency
– Increases with increase in problem size– Decreases with increase in number of processors
• Aggregation of tasks to increase granularity– Reduces communication overhead
• Data distribution– 2-dimensional may be more scalable than 1-dimensional– Has an effect on load balance too
• General techniques– Divide and conquer– Pipelining