performance metrics, prediction, and measurement · 2020. 1. 26. · cps343 (parallel and hpc)...

36
Performance Metrics, Prediction, and Measurement CPS343 Parallel and High Performance Computing Spring 2020 CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 1 / 33

Upload: others

Post on 30-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Performance Metrics, Prediction, and Measurement

CPS343

Parallel and High Performance Computing

Spring 2020

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 1 / 33

Page 2: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Outline

1 Analyzing Parallel ProgramsSpeedup and EfficiencyAmdahl’s Law and Gustafson-Barsis’s LawEvaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 2 / 33

Page 3: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Outline

1 Analyzing Parallel ProgramsSpeedup and EfficiencyAmdahl’s Law and Gustafson-Barsis’s LawEvaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 3 / 33

Page 4: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Speedup

Speedup is defined as

Speedup on N processors =sequential execution time

execution time on N processors=

tseqtpar

Generally we want to use execution times obtained using the bestavailable algorithm.

The best algorithm for a sequential program may be different that thebest algorithm for a parallel program.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 4 / 33

Page 5: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Efficiency

Efficiency is defined as

Efficiency =speedup

N=

tseqN · tpar

Given this definition, we expect

0 ≤ efficiency ≤ 1

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 5 / 33

Page 6: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Speedup

We can write ψ(n,N) for speedup to show it is a function of bothproblem size n and number of processors N.

Definitions:

σ(n): time required computation that is not parallelizableϕ(n): time required for computation that is parallelizableκ(n,N): time for parallel overhead (communication, barriers, etc.)

Note that

tseq = σ(n) + ϕ(n)tpar = σ(n) + ϕ(n)/N + κ(n,N)

Using these parameters, the speedup ψ(n,N) given by:

ψ(n,N) =σ(n) + ϕ(n)

σ(n) + ϕ(n)/N + κ(n,N)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 6 / 33

Page 7: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Speedup

Since κ(n,N) ≥ 0, if it is dropped (i.e. assume there is no paralleloverhead), we obtain an upper bound on speedup

ψ(n,N) ≤ σ(n) + ϕ(n)

σ(n) + ϕ(n)/N

Note: we will often not explicitly write n, so it is common to seeψ(N) for speedup for a given problem size (i.e., fixed n).

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 7 / 33

Page 8: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Outline

1 Analyzing Parallel ProgramsSpeedup and EfficiencyAmdahl’s Law and Gustafson-Barsis’s LawEvaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 8 / 33

Page 9: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Amdahl’s Law

First appearing in a paper by Gene Amdahl in 1967, this provides an upperbound on achievable speedup based on the fraction of computation thatcan be done in parallel.

T is the time needed for an application to execute on a single CPU.

α is the fraction of the computation that can be done in parallel...

...so 1− α is the fraction that must be carried out on a single CPU.

If we ignore parallel overhead we have

ψ(N) =tseqtpar≤ T

(1− α)T + α(T/N)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 9 / 33

Page 10: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Amdahl’s Law

Simplifying we have

ψ(N) =tseqtpar≤ 1

(1− α) + α/N

This is Amdahl’s Law. Note that is says something rather discouraging:Even as N →∞ (i.e., the number of processors increases without bound),speedup is bounded by

ψ ≤ 1

1− α.

For example, if 90% of the computation can be parallelized so α = 0.9,the speedup cannot be larger than 1/0.1 = 10, regardless of the number ofprocessors!

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 10 / 33

Page 11: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Amdahl’s Law speedup prediction

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 11 / 33

Page 12: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Amdahl’s Law efficiency prediction

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 12 / 33

Page 13: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Amdahl’s Law example

Suppose a serial program reads n data from a file, performs somecomputation, and then writes n data back out to another file. The I/Otime is measured and found to be 4500 + n µsec. If the computationportion takes n2/200 µsec, what is the maximum speedup we can expectwhen n=10,000 and N processors are used?

Assume that the I/O must be done serially but that the computation canbe parallelized. Computing α we find

α =n2/200

(4500 + n) + n2/200=

500000

4500 + 10000 + 500000=

5000

5145≈ 0.97182

so, by Amdahl’s Law,

ψ ≤ 1(1− 5000

5145

)+ 5000

5145N

=5145

145 + 5000/N

This gives a maximum speedup of 6.68 on 8 processors and 11.27 on 16processors.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 13 / 33

Page 14: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Amdahl’s Law example

Suppose a serial program reads n data from a file, performs somecomputation, and then writes n data back out to another file. The I/Otime is measured and found to be 4500 + n µsec. If the computationportion takes n2/200 µsec, what is the maximum speedup we can expectwhen n=10,000 and N processors are used?

Assume that the I/O must be done serially but that the computation canbe parallelized. Computing α we find

α =n2/200

(4500 + n) + n2/200=

500000

4500 + 10000 + 500000=

5000

5145≈ 0.97182

so, by Amdahl’s Law,

ψ ≤ 1(1− 5000

5145

)+ 5000

5145N

=5145

145 + 5000/N

This gives a maximum speedup of 6.68 on 8 processors and 11.27 on 16processors.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 13 / 33

Page 15: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

The Gustafson-Barsis Law

Amdahl’s Law reflects a particular perspective:

“We have a sequential program and want to figure out whatspeedup is attainable by parallelizing as much of it as possible.”

It turns out that focusing on parallelizing a sequential program is not aparticularly good way to estimate how much speedup is possible.

This is because parallel implementations often look very different than asequential implementation for the same program.

A more useful characterization comes from consider the problem the otherway around:

“We have a parallel program and want to figure out how muchfaster it is than a sequential program doing the same work.”

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 14 / 33

Page 16: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

The Gustafson-Barsis Law

Amdahl’s Law reflects a particular perspective:

“We have a sequential program and want to figure out whatspeedup is attainable by parallelizing as much of it as possible.”

It turns out that focusing on parallelizing a sequential program is not aparticularly good way to estimate how much speedup is possible.

This is because parallel implementations often look very different than asequential implementation for the same program.

A more useful characterization comes from consider the problem the otherway around:

“We have a parallel program and want to figure out how muchfaster it is than a sequential program doing the same work.”

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 14 / 33

Page 17: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

The Gustafson-Barsis Law

Amdahl’s law focuses on speedup as a function of increasing thenumber of processors; i.e., “how much faster can we get a fixedamount of work done using N processors?” Sometimes a betterquestion is “how much more work can we get done in a fixed amountof time using N processors?”

Let T be the total time a parallel program requires when using Nprocessors. As before, let 0 ≤ α ≤ 1 be the fraction of execution timethe program spends executing code in parallel. Then

tseq = (1− α)T + α · T · N

so

ψ ≤ tseqtpar

=(1− α)T + α · T · N

T= (1− α) + αN

This is the Gustafson-Barsis Law (1988).

The speedup estimate it produces is sometimes called scaled speedup.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 15 / 33

Page 18: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Gustafson-Barsis Law speedup prediction

This is much more encouraging than what Amdahl’s Law showed us.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 16 / 33

Page 19: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Gustafson-Barsis Law efficiency prediction

Again, this is much more encouraging!

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 17 / 33

Page 20: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Gustafson-Barsis’s Law example

A parallel program takes 134 seconds to run on 32 processors. The totaltime spent in the sequential part of the program was 12 seconds. What isthe scaled speedup?

Here α = (134− 12)/134 = 122/134 so the scaled speedup is

(1− α) + αN =

(1− 122

134

)+

122

134· 32 = 29.224

This means that the program is running approximately 29 times fasterthan the program would run on one processor..., assuming it could run onone processor.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 18 / 33

Page 21: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Gustafson-Barsis’s Law example

A parallel program takes 134 seconds to run on 32 processors. The totaltime spent in the sequential part of the program was 12 seconds. What isthe scaled speedup?

Here α = (134− 12)/134 = 122/134 so the scaled speedup is

(1− α) + αN =

(1− 122

134

)+

122

134· 32 = 29.224

This means that the program is running approximately 29 times fasterthan the program would run on one processor..., assuming it could run onone processor.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 18 / 33

Page 22: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

The laws compared

Amdahl’s Law approximately suggests: Suppose a car is travelingbetween two cities 60 miles apart, and has already spent one hour travelinghalf the distance at 30 mph. No matter how fast you drive the last half, itis impossible to achieve 90 mph average before reaching the second city.Since it has already taken you 1 hour and you only have a distance of 60miles total; going infinitely fast you would only achieve 60 mph.

Gustafson-Barsis’s Law approximately suggests: Suppose a car hasalready been traveling for some time at less than 90mph. Given enoughtime and distance to travel, the car’s average speed can always eventuallyreach 90mph, no matter how long or how slowly it has already traveled.For example, if the car spent one hour at 30 mph, it could achieve this bydriving at 120 mph for two additional hours, or at 150 mph for an hour.

(These were taken from a past Wikipedia page for Gustafson’s Law but no longer appear in the page as of Jan 2018).

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 19 / 33

Page 23: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Outline

1 Analyzing Parallel ProgramsSpeedup and EfficiencyAmdahl’s Law and Gustafson-Barsis’s LawEvaluating Parallel Algorithms

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 20 / 33

Page 24: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Speedup revisited

The above metrics ignore communication and other parallel overhead.When evaluating an approach to parallelizing a task, we shouldinclude estimates of communication costs since these can dominateparallel program time.

As before, we take tseq be the time for a sequential version of theprogram and tpar be the time for the parallel algorithm on Nprocessors. Then speedup is

ψ =tseqtpar

.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21 / 33

Page 25: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Parallel Execution Time

Parallel execution time tpar can be broken down into two parts,computation time tcomp and communication time tcomm.

tp = tcomp + tcomm

Speedup is then

ψ =tseqtpar

=tseq

tcomp + tcomm

The computation/communication ratio is tcomp/tcomm.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 22 / 33

Page 26: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Communication Time

Typically the time for communication can be broken down into two parts,the time tstartup necessary for building the message and initiating thetransfer, and the time tdata required per data item in the message. At afirst approximation this looks like

t

m

tstartup

where m is the number of data items sent.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 23 / 33

Page 27: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

A communication timing experiment

The values of tstartup and tdata for a parallel cluster can be determinedempirically.

Test program sends messages ranging in length from 100 to 10000integers between two nodes A and B.

Each message is sent from node A and received by node B and thensent back and received by node A.

Repeated 100 times

Linear regression used to fit line to timing data

Using the workstation cluster we had in Fall 2010 (just before wemoved in KOSC), we determined tstartup = 88.25µsec andtdata = 0.0415µsec per integer, which corresponds to 96.43 MB/s(assuming 4 bytes per integer)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 24 / 33

Page 28: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Workstation cluster timing data (Fall 2010)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 25 / 33

Page 29: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

A communication timing experiment

Repeating the experiement in Spring 2016 we found tstartup = 121.425µsecand tdata = 0.03685µsec per integer, which corresponds to 108.544 MB/s(assuming 4 bytes per integer)

However, the performance curve was surprising...

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 26 / 33

Page 30: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Workstation cluster timing data (Spring 2016)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 27 / 33

Page 31: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Minor Prophets cluster latency and bandwidth (2016)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 28 / 33

Page 32: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Canaan cluster latency and bandwidth (2016)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 29 / 33

Page 33: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

LittleFe cluster latency and bandwidth (2013)

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 30 / 33

Page 34: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Cluster latency and bandwidth comparison

Minor Prophets Canaan

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 31 / 33

Page 35: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Cluster latency and bandwidth comparison

Minor Prophets Canaan

Error bars show 1 standard deviation in data values averaged to produceplots.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 32 / 33

Page 36: Performance Metrics, Prediction, and Measurement · 2020. 1. 26. · CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 21/33. Parallel Execution

Acknowledgements

Some material used in creating these slides comes from

Multicore and GPU Programming: An Integrated Approach,Gerassimos Barlas, Morgan Kaufman/Elsevier, 2015.

Introduction to High Performance Scientific Computing, VictorEijkhout, 2015.

Parallel Programming in C with MPI and OpenMP, Michael Quinn,McGraw-Hill, 2004.

CPS343 (Parallel and HPC) Performance Metrics, Prediction, and Measurement Spring 2020 33 / 33