copyright © the mcgraw-hill companies, inc. permission...
TRANSCRIPT
1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 7
Performance AnalysisPerformance Analysis
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Learning Objectives
Predict performance of parallel programsPredict performance of parallel programs Predict performance of parallel programsPredict performance of parallel programs
Understand barriers to higher performanceUnderstand barriers to higher performance
2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Speedup Formula
iii l
timeexecution Parallel
timeexecution Sequential Speedup
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Execution Time Components
Inherently sequential computations:Inherently sequential computations: ((nn)) Inherently sequential computations: Inherently sequential computations: ((nn))
Potentially parallel computations: Potentially parallel computations: ((nn))
Parallel overhead: Parallel overhead: ((n,pn,p))
CommunicationCommunication
Redundant operationsRedundant operationsRedundant operationsRedundant operations
3
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Speedup Expression
)()()(
nn ),(/)()(
)()(),(
pnpnn
nnpn
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Computation Component - (n)/p
Processors
4
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Component -(n,p)
Processors
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
(n)/p + (n,p)
Processors
5
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Speedup Plot“elbowing out”
Processors
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Efficiencytimeexecution Sequential
Speedup
used Processors
Speedup Efficiency
timeexecution Parallel used Processors
timeexecution Sequential Efficiency
Processors
Speedup Speedup
timeexecution Parallel Processors Speedup
6
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Efficiency 0 (n,p) 1
),()()(
)()(),(
pnpnnp
nnpn
All terms > 0 (n,p) > 0
Denominator > numerator (n,p) < 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Amdahl’s Law Provides an upper bound on the speedup Provides an upper bound on the speedup
achievable by applying a certain number of achievable by applying a certain number of processors to solve the problem in parallelprocessors to solve the problem in parallel
Can be used to determine the asymptotic Can be used to determine the asymptotic speedup achievable as the number of speedup achievable as the number of processors increasesprocessors increasesprocessors increasesprocessors increases
7
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Amdahl’s LawThe speedup of a program using multiple processorsusing multiple processors in parallel computing is limited by the sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20× as shown in the diagram, no matter how many processors are used.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Amdahl’s Law
nn )()()(
pnn
nn
pnpnnpn
/)()(
)()(
),(/)()(
)()(),(
Let f = (n)/((n) + (n))f ( ) ( ( ) ( ))
pff /)1(
1
8
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 1 95% of a program’s execution time occurs 95% of a program’s execution time occurs
inside a loop that can be executed in inside a loop that can be executed in parallel.parallel.
What What is the maximum speedup we should is the maximum speedup we should expect from a parallel version of the expect from a parallel version of the program executing on 8 CPUs?program executing on 8 CPUs?program executing on 8 CPUs?program executing on 8 CPUs?
9.58/)05.01(05.0
1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 2
20% of a program’s execution time is spent20% of a program’s execution time is spent 20% of a program s execution time is spent 20% of a program s execution time is spent within inherently sequential code. What is within inherently sequential code. What is the limit to the speedup achievable by a the limit to the speedup achievable by a parallel version of the program?parallel version of the program?
115
2.0
1
/)2.01(2.0
1lim
pp
9
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pop Quiz
An oceanographer gives you a serialAn oceanographer gives you a serial An oceanographer gives you a serial An oceanographer gives you a serial program and asks you how much faster it program and asks you how much faster it might run on 8 processors. You can only might run on 8 processors. You can only find one function amenable to a parallel find one function amenable to a parallel solution. Benchmarking on a single solution. Benchmarking on a single processor reveals 80% of the execution time processor reveals 80% of the execution time ppis spent inside this function. What is the is spent inside this function. What is the best speedup a parallel version is likely to best speedup a parallel version is likely to achieve on 8 processors?achieve on 8 processors?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pop Quiz
A computer animation program generates aA computer animation program generates a A computer animation program generates a A computer animation program generates a feature movie framefeature movie frame--byby--frame. Each frame frame. Each frame can be generated independently and is can be generated independently and is output to its own file. If it takes 99 seconds output to its own file. If it takes 99 seconds to render a frame and 1 second to output it, to render a frame and 1 second to output it, h h d b hi d bh h d b hi d bhow much speedup can be achieved by how much speedup can be achieved by rendering the movie on 100 processors?rendering the movie on 100 processors?
10
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Limitations of Amdahl’s Law
IgnoresIgnores ((nn pp)) Ignores Ignores ((nn,,pp))
Overestimates speedup achievableOverestimates speedup achievable
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Amdahl Effect
TypicallyTypically ((nn pp) has lower complexity than) has lower complexity than Typically Typically ((nn,,pp) has lower complexity than ) has lower complexity than ((nn)/)/pp
As As nn increases, increases, ((nn)/)/pp dominates dominates ((nn,,pp))
As As nn increases, speedup increases, speedup increasesincreases
For a fixed number of processors speedupFor a fixed number of processors speedup For a fixed number of processors, speedup For a fixed number of processors, speedup is usually an increasing function of the is usually an increasing function of the problem sizeproblem size
11
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Illustration of Amdahl Effect
Speedup
n = 1,000
n = 10,000Speedup
n = 100
Processors
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Review of Amdahl’s Law
Treats problem size as a constantTreats problem size as a constant Treats problem size as a constantTreats problem size as a constant
Shows how execution time decreases as Shows how execution time decreases as number of processors increasesnumber of processors increases
12
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Another Perspective
We often use faster computers to solveWe often use faster computers to solve We often use faster computers to solve We often use faster computers to solve larger problem instanceslarger problem instances
Let’s treat time as a constant and allow Let’s treat time as a constant and allow problem size to increase with number of problem size to increase with number of processorsprocessors
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gustafson-Barsis’s Law
Given a parallel program solving a problem ofGiven a parallel program solving a problem ofGiven a parallel program solving a problem of Given a parallel program solving a problem of size n using p processors, let s denote the size n using p processors, let s denote the fraction of total execution time spent in fraction of total execution time spent in serial code. The maximum speedup serial code. The maximum speedup achievable by this program is achievable by this program is
spp )1(
13
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gustafson-Barsis’s Law
nn )()( pnn
nnpn
/)()(
)()(),(
Let s = (n)/((n)+(n)/p)
spp )1( pp )(
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gustafson-Barsis’s Law
Begin with parallel execution timeBegin with parallel execution time Begin with parallel execution timeBegin with parallel execution time
Estimate sequential execution time to solve Estimate sequential execution time to solve same problemsame problem
Problem size is an increasing function of Problem size is an increasing function of pp
PredictsPredicts scaled speedupscaled speedup Predicts Predicts scaled speedupscaled speedup
14
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 1
An application running on 10 processorsAn application running on 10 processors An application running on 10 processors An application running on 10 processors spends 3% of its time in serial code. What is spends 3% of its time in serial code. What is the scaled speedup of the application?the scaled speedup of the application?
73.927.010)03.0)(101(10
Execution on 1 CPU takes 10 times as long…
…except 9 do not have to execute serial code
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 2
What is the maximum fraction of aWhat is the maximum fraction of a What is the maximum fraction of a What is the maximum fraction of a program’s parallel execution time that can program’s parallel execution time that can be spent in serial code if it is to achieve a be spent in serial code if it is to achieve a scaled speedup of 7 on 8 processors?scaled speedup of 7 on 8 processors?
140)81(87 ss 14.0)81(87 ss
15
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pop Quiz
A parallel program executing on 32A parallel program executing on 32 A parallel program executing on 32 A parallel program executing on 32 processors spends 5% of its time in processors spends 5% of its time in sequential code. What is the scaled speedup sequential code. What is the scaled speedup of this program?of this program?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The Karp-Flatt Metric
Amdahl’s Law and GustafsonAmdahl’s Law and Gustafson--Barsis’ LawBarsis’ Law Amdahl s Law and GustafsonAmdahl s Law and Gustafson--Barsis Law Barsis Law ignore ignore ((nn,,pp))
They can overestimate speedup or scaled They can overestimate speedup or scaled speedupspeedup
Karp and Flatt proposed another metricKarp and Flatt proposed another metricp p pp p p
16
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Experimentally Determined Serial Fraction
I h l i l
)()(
),()(
nn
pnne
Inherently serial componentof parallel computation +processor communication andsynchronization overhead
Single processor execution time
p
pe
/11
/1/1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Experimentally Determined Serial Fraction
Takes into account parallel overheadTakes into account parallel overhead
Detects other sources of overhead or Detects other sources of overhead or inefficiency ignored in speedup modelinefficiency ignored in speedup model
Process startup timeProcess startup timepp
Process synchronization timeProcess synchronization time
Imbalanced workloadImbalanced workload
Architectural overheadArchitectural overhead
17
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 1
2 3 4 5 6 7 8p 2 3 4 5 6 7
1.8 2.5 3.1 3.6 4.0 4.4
8
4.7
What is the primary reason for speedup of only 4.7 on 8 CPUs?
e 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Since e is constant, large serial fraction is the primary reason.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 2
2 3 4 5 6 7 8p 2 3 4 5 6 7
1.9 2.6 3.2 3.7 4.1 4.5
8
4.7
What is the primary reason for speedup of only 4.7 on 8 CPUs?
e 0.070 0.075 0.080 0.085 0.090 0.095 0.100
Since e is steadily increasing, overhead is the primary reason.
18
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pop Quiz
4 8 12
Is this program likely to achieve a speedup of 10 Is this program likely to achieve a speedup of 10 on 12 processors?on 12 processors?
p 4
3.9
8
6.5
12
?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Metric
Parallel system: parallel program executingParallel system: parallel program executing Parallel system: parallel program executing Parallel system: parallel program executing on a parallel computeron a parallel computer
Scalability of a parallel system: measure of Scalability of a parallel system: measure of its ability to increase performance as its ability to increase performance as number of processors increasesnumber of processors increases
A scalable system maintains efficiency asA scalable system maintains efficiency as A scalable system maintains efficiency as A scalable system maintains efficiency as processors are addedprocessors are added
Isoefficiency: way to measure scalabilityIsoefficiency: way to measure scalability
19
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Derivation Steps
Begin with speedup formulaBegin with speedup formula Begin with speedup formulaBegin with speedup formula
Compute total amount of overheadCompute total amount of overhead
Assume efficiency remains constantAssume efficiency remains constant
Determine relation between sequential Determine relation between sequential execution time and overheadexecution time and overheadexecution time and overheadexecution time and overhead
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Deriving Isoefficiency RelationDetermine overhead
),()()1(),( pnpnppnTo Substitute overhead into speedup equation
),()()())()((
0),( pnTnn
nnppn
Substitute T(n,1) = (n) + (n). Assume efficiency is constant.
),()1,( 0 pnCTnT Isoefficiency Relation
20
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Relation
In order to maintain the same level ofIn order to maintain the same level ofIn order to maintain the same level of In order to maintain the same level of efficiency as the number of processors efficiency as the number of processors increases, increases, nn must be increased so that the must be increased so that the following inequality is satisfied where following inequality is satisfied where TT((nn,1) ,1) and and CTCT00((nn,,pp) are as defined in the book) are as defined in the book
),()1,( 0 pnCTnT
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Scalability Function
SupposeSuppose isoefficiencyisoefficiency relation isrelation is nn f(p)f(p) Suppose Suppose isoefficiencyisoefficiency relation is relation is nn f(p)f(p)
Let Let M(n)M(n) denote memory required for denote memory required for problem of size problem of size nn
M(f(p))/p M(f(p))/p shows how memory usage shows how memory usage per per processorprocessor must increase to maintain same must increase to maintain same ppefficiencyefficiency
We call We call M(f(p))/p M(f(p))/p the the scalability functionscalability function
21
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Meaning of Scalability Function
To maintain efficiency when increasingTo maintain efficiency when increasing pp wewe To maintain efficiency when increasing To maintain efficiency when increasing pp, we , we must increase must increase nn
Maximum problem size limited by available Maximum problem size limited by available memory, which is linear in memory, which is linear in pp
Scalability function shows how memory usage per Scalability function shows how memory usage per processor must grow to maintain efficiencyprocessor must grow to maintain efficiencyprocessor must grow to maintain efficiencyprocessor must grow to maintain efficiency
Scalability function a constant means parallel Scalability function a constant means parallel system is perfectly scalablesystem is perfectly scalable
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Interpreting Scalability Function
sor Cplogp
need
ed p
er p
roce
ss
Cp
Memory Size
Can maintainffi i
Cannot maintainefficiency
Number of processors
Mem
ory
Clogp
C
efficiency
22
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 1: Reduction
Sequential algorithm complexitySequential algorithm complexity Sequential algorithm complexitySequential algorithm complexityTT((nn,1) = ,1) = ((nn))
Parallel algorithmParallel algorithm
Computational complexity = Computational complexity = ((n/pn/p))
Communication complexity =Communication complexity = (log(log pp))Communication complexity Communication complexity (log(log pp))
Parallel overheadParallel overheadTT00((nn,,pp) = ) = ((pp log log pp))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Reduction (continued)
Isoefficiency relation:Isoefficiency relation: nn C pC p loglog pp Isoefficiency relation: Isoefficiency relation: nn C pC p log log pp We ask: To maintain same level of We ask: To maintain same level of
efficiency, how must efficiency, how must nn increase when increase when ppincreases?increases?
M(n)M(n) = = nn
l/l/)l(
The system has good scalabilityThe system has good scalability
pCppCpppCpM log/log/)log(
23
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 2: Floyd’s Algorithm
Sequential time complexity:Sequential time complexity: ((nn33)) Sequential time complexity: Sequential time complexity: ((nn ))
Parallel computation time: Parallel computation time: ((nn33//pp))
Parallel communication time: Parallel communication time: ((nn22log log pp))
Parallel overhead: Parallel overhead: TT00((nn,,pp) = ) = ((pnpn22log log pp))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Floyd’s Algorithm (continued)
Isoefficiency relationIsoefficiency relation Isoefficiency relationIsoefficiency relationnn33 CC((p np n33 log log pp) ) nn C pC p log log pp
M(n) = nM(n) = n22
ppCpppCppCpM 22222 log/log/)log(
The parallel system has poor scalabilityThe parallel system has poor scalability
24
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3: Finite Difference
Sequential time complexity per iteration:Sequential time complexity per iteration: Sequential time complexity per iteration: Sequential time complexity per iteration: ((nn22))
Parallel communication complexity per Parallel communication complexity per iteration: iteration: ((nn//p)p)
Parallel overhead: Parallel overhead: ((nn p)p)(( p)p)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Finite Difference (continued)
Isoefficiency relationIsoefficiency relation Isoefficiency relationIsoefficiency relationnn22 CnCnp p nn CC pp
M(n)M(n) = = nn22
22 //)( CppCppCM This algorithm is perfectly scalableThis algorithm is perfectly scalable
25
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary
Performance termsPerformance terms Performance termsPerformance terms
SpeedupSpeedup
EfficiencyEfficiency
Model of speedupModel of speedup
Serial componentSerial componentSerial componentSerial component
Parallel componentParallel component
Communication componentCommunication component
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary
What prevents linear speedup?What prevents linear speedup? What prevents linear speedup?What prevents linear speedup?
Serial operationsSerial operations
Communication operationsCommunication operations
Process startProcess start--upup
Imbalanced workloadsImbalanced workloads Imbalanced workloadsImbalanced workloads
Architectural limitationsArchitectural limitations
26
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary
Analyzing parallel performanceAnalyzing parallel performance Analyzing parallel performanceAnalyzing parallel performance
Amdahl’s LawAmdahl’s Law
GustafsonGustafson--Barsis’ LawBarsis’ Law
KarpKarp--Flatt metricFlatt metric
Isoefficiency metricIsoefficiency metric Isoefficiency metricIsoefficiency metric
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Amdahl’s Law
Based on the assumption that we are trying toBased on the assumption that we are trying to Based on the assumption that we are trying to Based on the assumption that we are trying to solve a problem of fixed size as quickly as solve a problem of fixed size as quickly as possiblepossible Provides an upper bound on the speedup achievable by Provides an upper bound on the speedup achievable by
applying a certain number of processors to solve the applying a certain number of processors to solve the problem in parallelproblem in parallel
Can be used to determine the asymptotic speedup Can be used to determine the asymptotic speedup achievable as the number of processors increasesachievable as the number of processors increases
27
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gustafson-Barsis’s Law
Time is treated as a constant and we let theTime is treated as a constant and we let the Time is treated as a constant and we let the Time is treated as a constant and we let the problem size increase with the number of problem size increase with the number of processorsprocessors
It begins with a parallel computation and It begins with a parallel computation and estimates how much faster the parallel estimates how much faster the parallel computation is than the same computation computation is than the same computation executing on a single processorexecuting on a single processor
Called scaled speedupCalled scaled speedup
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Karp-Flatt Metric Takes into account parallel overheadTakes into account parallel overhead
Helps detect other sources of overhead orHelps detect other sources of overhead or Helps detect other sources of overhead or Helps detect other sources of overhead or inefficiency inefficiency
For a fixed problem size, the efficiency of a For a fixed problem size, the efficiency of a parallel computation typically decreases as the parallel computation typically decreases as the number of processors increasenumber of processors increase By using the experimentally determined serial fraction,By using the experimentally determined serial fraction, By using the experimentally determined serial fraction, By using the experimentally determined serial fraction,
can determine whether is efficiency decrease is due to:can determine whether is efficiency decrease is due to: Limited opportunities for parallelismLimited opportunities for parallelism
Increases in algorithmic or architectural overheadIncreases in algorithmic or architectural overhead