parallel programming using mpi analysis and optimization
TRANSCRIPT
![Page 1: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/1.jpg)
Parallel programming using MPI
Analysis and optimization
Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
![Page 2: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/2.jpg)
Outline
l Parallel programming: Basic definitions
l Choosing right algorithms: Optimal serial and parallel
l Load Balancing Rank ordering, Domain decomposition
l Blocking vs Non blocking Overlap computation and communication
l MPI-IO and avoiding I/O bottlenecks
l Hybrid Programming model MPI + OpenMP MPI + Accelerators for GPU clusters
![Page 3: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/3.jpg)
Choosing right algorithms: How does your serial algorithm scale?
!
"#$%$&#'! "%()! *+%(,-)!!./01! 2#'3$%'$! 4)$)5(&'&'6!&7!%!'8(9)5!&3!):)'!#5!#;;<!83&'6!%!=%3=!$%9-)!!./!-#6-#6!'1! ;#89-)!-#6! >&';&'6!%'!&$)(!83&'6!&'$)5,#-%$&#'!3)%52=!&'!%!3#5$);!%55%?!!./!-#6!'1! -#6! >&';&'6!%'!&$)(!&'!%!3#5$);!%55%?!@&$=!%!9&'%5?!3)%52=!!./!'2!1A!2B0! 75%2$&#'%-!,#@)5! C)%52=&'6!&'!%!D;E$5))!!./'1! -&')%5! >&';&'6!%'!&$)(!&'!%'!8'3#5$);!-&3$!#5!&'!%'!8'3#5$);!%55%?!!./!'-#6!'!1! -#6-&')%5! F)57#5(&'6!%!>%3$!>#85&)5!$5%'37#5(<!=)%,3#5$A!G8&2D3#5$!!./'H1! G8%;5%$&2! "%I:)!9899-)!3#5$!!./'21A!2J0! ,#-?'#(&%-! K%$5&+!(8-$&,-&2%$&#'A!&':)53&#'!!./!2'!1A!2J0! )+,#')'$&%-! >&';&'6!$=)!/)+%2$1!3#-8$&#'!$#!$=)!$5%:)--&'6!3%-)3(%'!,5#9-)(!!./!'L!1! 7%2$#5&%-! 6)')5%$&'6!%--!8'5)3$5&2$);!,)5(8$%$&#'3!#7!%!,#3)$!
![Page 4: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/4.jpg)
Parallel programming concepts: Basic definitions
![Page 5: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/5.jpg)
Speedup: Ratio of parallel to serial execution time
Efficiency: The ratio of speedup to the number of processors
Work Product of parallel time and processors
Parallel overhead Idle time wasted in parallel execution
S = tst p
E = Sp
Parallel programming concepts: Performance metrics
W = tp
T0 = ptp ! ts
![Page 6: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/6.jpg)
Parallel programming concepts Analyzing serial vs parallel algorithms
Ask yourself What fraction of the code can you completely parallelize ?
f ? How does problem size scale? Processors scale as p. How does problem-size n scale with p?
n(p) ? How does parallel overhead grow?
To(p) ? Does the problem scale?
M(p)
Serial drawbacks
Parallel drawbacks
Serial percentage
Contention(Memory, Network)
Communication/idling
Extra computation
Single core
Latency
![Page 7: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/7.jpg)
Parallel programming concepts Analyzing serial vs parallel algorithms
Amdahl’s law
What fraction of the code can you completely parallelize ?
f ?
Serial time: ts Parallel time: fts + (1-f) (ts/p)
Serial drawbacks
Parallel drawbacks
Serial percentage
Contention(Memory, Network)
Communication/idling
Extra computation
Single core
Latency
S = tst p
= tsts f +
(1! f )tsp
= 1
f + (1! f )p
![Page 8: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/8.jpg)
Parallel programming concepts Analyzing serial vs parallel algorithms
Quiz
if the serial fraction is 5%, what is the maximum speedup you can achieve ?
Serial time = 100 secs
Serial percentage = 5 %
Maximum speedup ?
Serial drawbacks
Parallel drawbacks
Serial percentage
Contention(Memory, Network)
Communication/idling
Extra computation
Single core
Latency
![Page 9: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/9.jpg)
Parallel programming concepts Analyzing serial vs parallel algorithms
Amdahl’s law
What fraction of the code can you completely parallelize ?
f ?
Serial time = 100 secs
Parallel percentage = 5 %
Serial drawbacks
Parallel drawbacks
Serial percentage
Contention(Memory, Network)
Communication/idling
Extra computation
Single core
Latency
= 1
.05 + (1! f )p
= 1.05 + 0
= 20
![Page 10: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/10.jpg)
Parallel programming concepts Analyzing serial vs parallel algorithms
Amdahl's Law approximately suggests: “ Suppose a car is traveling between two
cities 60 miles apart, and has already spent one hour traveling half the distance at 30 mph. No matter how fast you drive the last half, it is impossible to achieve 90 mph average before reaching the second city ”
Gustafson's Law approximately states: “ Suppose a car has already been
traveling for some time at less than 90mph. Given enough time and distance to travel, the car's average speed can always eventually reach 90mph, no matter how long or how slowly it has already traveled.”
Serial drawbacks
Parallel drawbacks
Serial percentage
Contention(Memory, Network)
Communication/idling
Extra computation
Single core
Latency
Source: http://disney.go.com/cars/ http://en.wikipedia.org/wiki/Gustafson's_law
![Page 11: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/11.jpg)
Parallel programming concepts Analyzing serial vs parallel algorithms
Communication Overhead Simplest model:
Transfer time = Startup time
+ Hop time(Node latency) + (Message length)/Bandwidth
= ts + thl + twl
Send one big message instead of several small messages! Reduce the total amount of bytes! Bandwidth depends on protocol
Serial drawbacks
Parallel drawbacks
Serial percentage
Contention(Memory, Network)
Communication/idling
Extra computation
Single core
Latency
![Page 12: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/12.jpg)
Parallel programming concepts Analyzing serial vs parallel algorithms
Point to point (MPI_Send) ( ts + tw m )
Collective overhead All-to-all Broadcast (MPI Allgather): ts log2p + (p−1) tw m
All-reduce (MPI Allreduce) : ( ts + tw m ) log2p Scatter and Gather (MPI Scatter) : ( ts + tw m ) log2p All to all (personalized): (p−1) ( ts + tw m )
Serial drawbacks
Parallel drawbacks
Serial percentage
Contention(Memory, Network)
Communication/idling
Extra computation
Single core
Latency
![Page 13: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/13.jpg)
Isoefficiency: Can we maintain efficiency/speedup of the algorithm?
How should the problem size scale with p to keep efficiency constant?
Parallel programming: Basic definitions
E = 11+ (To /w)
Maintain ratio To(W,p) / W , overhead to parallel work constant
![Page 14: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/14.jpg)
Isoefficiency relation: To keep efficiency constant you must increase problem size such that
Procedure: 1. Get the sequential time T(n,1) 2. Get the parallel time pT(n,p) 3. Calculate the overhead To=pT(n,p) - T(n,1)
Parallel programming: Basic definitions
T (n,1) ! To(n, p)
How does the overhead compare to the useful work being done?
![Page 15: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/15.jpg)
Isoefficiency relation: To keep efficiency constant you must increase problem size such that
Scalability: Do you have enough resources(memory) to scale to that size
Parallel programming: Basic definitions
T (n,1) ! To(n, p)
Maintain ratio To(W,p) / W , overhead to parallel work constant
![Page 16: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/16.jpg)
Scalaility: Do you have enough resources(memory) to scale to that size
Parallel programming: Basic definitions
Number of processors
Mem
ory
need
ed p
er p
roce
ssor
Cplogp
Cp
C
Memory Size
Can maintain efficiency
Cannot maintain efficiency
![Page 17: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/17.jpg)
Adding numbers
Each processor has n/p numbers.
Serial time
nCommunicate and add
log p + log p
4, 3, 9 7,5,8 5,2,9 1, 3, 7
Parallel time
n / p
![Page 18: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/18.jpg)
Adding numbers
Each processor has n/p numbers. Steps to communicate and add are 2 log p
Speedup:
S = nnp+ 2 log p!
"#$%&
E = nn + 2p log p( )
Isoefficiency
If you increase problem size n as O(p log p)
then efficiency can remain constant !
![Page 19: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/19.jpg)
Adding numbers
Each processor has n/p numbers. Steps to communicate and add are 2 log p
Speedup:
E = nn + 2p log p( )
Scalability M (n) ! (n / p) = p log p / p= log p
![Page 20: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/20.jpg)
Sorting
Each processor has n/p numbers.
4, 3, 9 7,5,8 5,2,9 1, 3, 7
Our plan
1. Split list into parts 2. Sort parts individually 3. Merge lists to get sorted list
![Page 21: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/21.jpg)
Sorting
Each processor has n/p numbers.
4, 3, 9 7,5,8 5,2,9 1, 3, 7
Background
Bubble sort O(n2) Stable Exchanging Selection sort O(n2) Unstable Selection Insertion sort O(n2) Stable Insertion Merge sort O(nlog n) Stable Merging Quick sort O(nlog n) Unstable Partitioning
![Page 22: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/22.jpg)
Choosing right algorithms: Optimal serial and parallel
Case Study: Bubble sort
Main loop For i : 1 to length_of(A) -1 Secondary loop For j : i+1 to length_of(A) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] )
[ 5 1 4 2 ]
![Page 23: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/23.jpg)
Choosing right algorithms: Optimal serial and parallel
Case Study: Bubble sort
Main loop For i : 1 to length_of(A) -1 Secondary loop For j : i+1 to length_of(A) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] )
(i=3, j=4) [ 1 2 5 4 ] Y
(i=2, j=4) [ 1 4 5 2 ] Y
(i=2, j=3) [ 1 5 4 2 ] Y
(i=1, j=4) [ 1 5 4 2 ] N
(i=1, j=3) [ 1 5 4 2 ] N
(i=1, j=2) [ 5 1 4 2 ] Y
Compare Swap
![Page 24: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/24.jpg)
Choosing right algorithms: Optimal serial and parallel
Case Study: Bubble sort
Main loop For i : 1 to length_of(A) -1 Secondary loop For j : i+1 to length_of(A) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] )
(i=3, j=4) [ 1 2 5 4 ] Y
(i=2, j=4) [ 1 4 5 2 ] Y
(i=2, j=3) [ 1 5 4 2 ] Y
(i=1, j=4) [ 1 5 4 2 ] N
(i=1, j=3) [ 1 5 4 2 ] N
(i=1, j=2) [ 5 1 4 2 ] Y
Compare Swap
N(N-1)/2 = O(N2)
Comparisons
![Page 25: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/25.jpg)
Choosing right algorithms: Optimal serial and parallel
Case Study: Merge sort Recursively merge lists having one element each
[ 1 2 4 5 ]
[ 1 5 ] [ 2 4 ]
Merge
[ 1 ] [ 5 ] [ 4 ] [ 2 ]
[ 1 5 ] [ 4 2 ]
[ 5 1 4 2 ]
Split
![Page 26: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/26.jpg)
Choosing right algorithms: Optimal serial and parallel
[ 1 2 4 5 ]
[ 1 5 ] [ 2 4 ]
Merge
[ 1 ] [ 5 ] [ 4 ] [ 2 ]
[ 1 5 ] [ 4 2 ]
[ 5 1 4 2 ]
Split
Best sorting algorithms need
O( N log N)
Case Study: Merge sort Recursively merge lists having one element each
![Page 27: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/27.jpg)
Choosing right algorithms: Parallel sorting technique
Merging : Merge p lists having n/p elements each
O np( )
0 1
2 3
…
n-1
Sub-optimally : Pop-push merge on 1 processor
[ 1 3 5 6 ] => [ 1 ] [ 2 4 6 8 ] [ 3 5 6 ] => [ 1 2 ] [ 2 4 6 8 ] [ 3 5 6 ] => [ 1 2 3 ] [ 4 6 8 ]
![Page 28: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/28.jpg)
Choosing right algorithms: Parallel sorting technique
Merging : Merge p lists having n/p elements each
O n log p( )Merge 8 lists on 4
Merge 4 lists on 2
Merge 2 lists on 1 0
0
0 2
1
1 3
Optimal: Recursively or tree based
Best merge algorithms need
![Page 29: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/29.jpg)
Sorting
Each processor has n/p numbers.
Serial time
n lognMerge time
(n !1). np
4, 3, 9 7,5,8 5,2,9 1, 3, 7
Parallel time nplog n
p!"#
$%&
np+ n log p
![Page 30: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/30.jpg)
Sorting
Each processor has n/p numbers.
Overhead
= n logn ! p (n / p)log(n / p)+ n log p( )! p log p + np log p! n log p
Isoefficiency Scalability
4, 3, 9 7,5,8 5,2,9 1, 3, 7
n logn ! c(n log p)" n ! pc
= n / p = pc!1Low for c>2
![Page 31: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/31.jpg)
Data decomposition
1
2
3
4
5
6
7
8
9
R2
R1
R3
1 4 7
2
3
5
6 9
8
1D : Row-wise
![Page 32: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/32.jpg)
Data decomposition
C1
1
2
3
C2
4
5
6
C3
7
8
9
1
2
3
4
5
6
7
8
9
1D : Column-wise
![Page 33: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/33.jpg)
Data decomposition
2D : Block-wise 1
2
3
4
5
6
7
8
9 B1 B2 B3
B4 B5 B6
B6 B7 B8
![Page 34: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/34.jpg)
Data decomposition
a(i-1, j)
a(i, j+1)
a(i+1, j)
a(i, j+1)
Time to communicate 1 cell: Time to evaluate stencil once:
tcellcomm = ! s + tw
tcellcomp = 5*(t float )
Laplace solver: (n x n) mesh with p processors
![Page 35: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/35.jpg)
Data decomposition
Laplace solver: 1D Row-wise (n x n) with p processors
Proc 0
Proc 1
a(i-1, j)
a(i, j+1)
a(i+1, j)
a(i, j+1)
![Page 36: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/36.jpg)
Data decomposition
a(i-1, j)
a(i, j+1)
a(i+1, j)
a(i, j+1)
Serial time: Parallel computation time Ghost communication:
tseqcomp = n2tcell
comm
t processcomp = n
2
ptcellcomp
t comm = 2ntcellcomm
Laplace solver: 1D Row-wise (n x n) with p processors
![Page 37: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/37.jpg)
Data decomposition
a(i-1, j)
a(i, j+1)
a(i+1, j)
a(i, j+1)
Overhead: Isoefficieny: Scalability :
= tseq ! ptp= pn
n2 ! cnp" n ! cp
Laplace solver: 1D Row-wise (n x n) with p processors
= c2p2 / p = CpPoor Scaling
![Page 38: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/38.jpg)
Data decomposition
Laplace solver: 2D Block-wise
a(i-1, j)
a(i, j+1)
a(i+1, j)
a(i, j+1)
Proc 0 Proc 2
Proc 1 Proc 3
![Page 39: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/39.jpg)
Data decomposition
Serial time: Parallel computation time: Ghost communication:
tseqcomp = n2tcell
comm
t comm = 4nptcellcomm
t processcomp = n
2
ptcellcomp
a(i-1, j)
a(i, j+1)
a(i+1, j)
a(i, j+1)
Laplace solver: 2D Row-wise (n x n) with p processors
![Page 40: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/40.jpg)
Data decomposition
Overhead: Isoefficiency: Scalability :
n ! p
= p.n / p
= n p a(i-1, j)
a(i, j+1)
a(i+1, j)
a(i, j+1)
Laplace solver: 2D Row-wise (n x n) with p processors
= (c p )2 / p= C Perfect Scaling
![Page 41: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/41.jpg)
Data decomposition Matrix vector multiplication: 1D row-wise decomposition
Proc 2
Proc 0
Proc 1 X
11 12 13
14 15 16
19 18 17
1
2
3
Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each
O n2
p!"#
$%&
Communication: All gather in the end so each processor has full copy of output vector
log p + 2i!1. npi=1
log p
" = log p + n(p !1)p
![Page 42: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/42.jpg)
Data decomposition Matrix vector multiplication: 1D row-wise decomposition
Proc 2
Proc 0
Proc 1 X
11 12 13
14 15 16
19 18 17
1
2
3
Algorithm: 1. Collect vector using
MPI_Allgather
2. Local matrix multiplication to get output vector Wastes much memory
![Page 43: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/43.jpg)
Data decomposition Matrix vector multiplication: 1D row-wise decomposition
Proc 2
Proc 0
Proc 1 X
11 12 13
14 15 16
19 18 17
1
2
3
Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each
O n2
p!"#
$%&
Communication: All gather in the end so each processor has full copy of output vector
!wn +! s log p
! s p log p +! wnpOverhead:
![Page 44: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/44.jpg)
Data decomposition Matrix vector multiplication: 1D row-wise decomposition
Proc 2
Proc 0
Proc 1 X
11 12 13
14 15 16
19 18 17
1
2
3
Speedup:
S = p
1+ p(! s log p + twn)tcn
2
"#$
%&'
Isoefficiency:
n2 ! p log p + np! n " cp
Scalability:
M (p) ! n2 / p = c2p
Not scalable !
![Page 45: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/45.jpg)
Data decomposition Matrix vector multiplication: 1D column-wise decomposition
Proc 1 Proc 2 Proc 0
11 12 13
14 15 16
19 18 17
1
2
3
Serial Computation?
Parallel Computation?
Overhead?
Isoefficiency?
Scalability?
![Page 46: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/46.jpg)
Data decomposition Matrix vector multiplication: 2D decomposition
proc 6 proc 7 proc 8
proc 0 proc 1 proc 2
X proc 3 proc 4 proc5
A10 A11 A12
A22 A21 A20
v0
v1
v2
Algorithm: Uses p Processors on a grid
A00 A01 A02
![Page 47: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/47.jpg)
Data decomposition Matrix vector multiplication: 2D decomposition
proc 6 proc 7 proc 8
proc 0 proc 1 proc 2
X proc 3 proc 4 proc5
A10 A11 A12
A22 A21 A20
v0
v1
v2
A00 A01 A02
Algorithm Step 0: Copy part-vector to diagonal
![Page 48: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/48.jpg)
Data decomposition Matrix vector multiplication: 2D decomposition
proc 6 proc 7 proc 8
proc 0 proc 1 proc 2
X proc 3 proc 4 proc5
A10 A11 A12
A22 A21 A20
v0
v1
v2
A00 A01 A02
Algorithm Step 1: Broadcast vector along columns
![Page 49: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/49.jpg)
Data decomposition Matrix vector multiplication: 2D decomposition
proc 6 proc 7 proc 8
proc 0 proc 1 proc 2
X proc 3 proc 4 proc5
A10v0 A11v1 A12v2
A22v2 A21v1 A20v0
v0
v1
v2
A00vo A01v1 A02v2
Algorithm Step 2: Local computation on each processor
![Page 50: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/50.jpg)
Data decomposition Matrix vector multiplication: 2D decomposition
Algorithm Step 3: Reduce across rows
proc 6 proc 7 proc 8
proc 0 proc 1 proc 2
X proc 3 proc 4 proc5
A10v0 A11v1 A12v2
A22v2 A21v1 A20v0
v0
v1
v2
A00vo A01v1 A02v2 u0
u1
u2
![Page 51: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/51.jpg)
Data decomposition Matrix vector multiplication: 2D decomposition
proc 6 proc 7 proc 8
proc 0 proc 1 proc 2
X proc 3 proc 4 proc5
A10 A11 A12
A22 A21 A20
v0
v1
v2
A00 A01 A02
Computation: Each processor computes n2/p elements,
! cn2
p"#$
%&'
Communication:
2nplog( p )+ n
p!n log p
p
Overhead:
n p log(p)
![Page 52: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/52.jpg)
Data decomposition Matrix vector multiplication: 2D decomposition
proc 6 proc 7 proc 8
proc 0 proc 1 proc 2
X proc 3 proc 4 proc5
A10 A11 A12
A22 A21 A20
v0
v1
v2
A00 A01 A02 Isoefficiency:
n2 ! n p log p
! n " c p log p
Scalability:
M (p) ! n2
p= (log p)2
Scales better than 1D !
![Page 53: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/53.jpg)
Lets look at the code
![Page 54: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/54.jpg)
Summary
u Know your algorithm !
u Don’t expect the unexpected !
u Pay attention to parallel design and implementation right from the outset. It will save you lot of labor.
![Page 55: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/55.jpg)
![Page 56: Parallel programming using MPI Analysis and optimization](https://reader030.vdocuments.net/reader030/viewer/2022011913/61d7aeb3dc96bb090b0ee1b1/html5/thumbnails/56.jpg)