performance and scalability class ppt

63
Principles of Scalable Performance

Upload: kanaga-varatharajan

Post on 07-Nov-2014

79 views

Category:

Documents


18 download

TRANSCRIPT

Page 1: Performance and Scalability Class Ppt

Principles of Scalable Performance

Page 2: Performance and Scalability Class Ppt

EENG-630 Chapter 3 2

Principles of Scalable Performance

Performance measuresSpeedup lawsScalability principlesScaling up vs. scaling down

Page 3: Performance and Scalability Class Ppt

3

Performance metrics and measures

Parallelism profilesAsymptotic speedup factorSystem efficiency, utilization and qualityStandard performance measures

Page 4: Performance and Scalability Class Ppt

4

Degree of parallelism

Reflects the matching of software and hardware parallelismDiscrete time function – measure, for each time period, the # of processors usedParallelism profile is a plot of the DOP as a function of timeIdeally have unlimited resources

Page 5: Performance and Scalability Class Ppt

Degree of Parallelism

The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time.DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions.A plot of DOP vs. time is called a parallelism profile.

Page 6: Performance and Scalability Class Ppt

6

Factors affecting parallelism profiles

Algorithm structureProgram optimizationResource utilizationRun-time conditionsRealistically limited by # of available processors, memory, and other nonprocessor resources

Page 7: Performance and Scalability Class Ppt

Average Parallelism - 1Assume the following:

n homogeneous processorsmaximum parallelism in a profile is mIdeally, n >> m, the computing capacity of a processor, is something like MIPS or Mflops w/o regard for memory latency, etc.i is the number of processors busy in an observation period (e.g. DOP = i )W is the total work (instructions or computations) performed by a programA is the average parallelism in the program

Page 8: Performance and Scalability Class Ppt

Average Parallelism - 2

2

1

)(t

t

dttDOPW

m

iitiW

1

where ti = total time that DOP = i, and

m

ii ttt

112

Total amount of work performed is proportional to the area under the profile curve

Page 9: Performance and Scalability Class Ppt

Average Parallelism - 3

2

1

)(1

12

t

t

dttDOPtt

A

m

ii

m

ii ttiA

11

/

Page 10: Performance and Scalability Class Ppt

10

Example: parallelism profile and average parallelism

Page 11: Performance and Scalability Class Ppt

Available Parallelism

Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle).But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).

Page 12: Performance and Scalability Class Ppt

Basic Blocks

A basic block is a sequence or block of instructions with one entry and one exit.Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of registers utilized in the block).Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in typical code).

Page 13: Performance and Scalability Class Ppt

Asymptotic Speedup - 1

m

iiWW

1

ii tiW (work done when DOP = i)

(relates sum of Wi terms to W)

kWkt ii /)( (execution time with k processors)

iWt ii /)( (for 1 i m)

Page 14: Performance and Scalability Class Ppt

Asymptotic Speedup - 2

m

i

m

i

ii

WtT

1 1

)1()1( (resp. time w/ 1 proc.)

m

i

m

i

ii i

WtT

1 1

)()( (resp. time w/ proc.)

AiW

W

T

TS m

i i

m

i i

1

1

/)(

)1((in the ideal case)

Page 15: Performance and Scalability Class Ppt

15

Performance measures

Consider n processors executing m programs in various modesWant to define the mean performance of these multimode computers:

Arithmetic mean performanceGeometric mean performanceHarmonic mean performance

Page 16: Performance and Scalability Class Ppt

Mean Performance Calculation

We seek to obtain a measure that characterizes the mean, or average, performance of a set of benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).We may also wish to associate weights with these programs to emphasize these different modes and yield a more meaningful performance measure.

Page 17: Performance and Scalability Class Ppt

17

Arithmetic mean performance

m

i

ia mRR1

/

m

iiia RfR

1

* )(

Arithmetic mean execution rate(assumes equal weighting)

Weighted arithmetic mean execution rate

-proportional to the sum of the inverses of execution times

Page 18: Performance and Scalability Class Ppt

18

Geometric mean performance

m

i

mig RR

1

/1

m

i

fig

iRR1

*

Geometric mean execution rate

Weighted geometric mean execution rate

-does not summarize the real performance since it does not have the inverse relation with the total time

Page 19: Performance and Scalability Class Ppt

19

Harmonic mean performance

ii RT /1

m

i i

m

iia Rm

Tm

T11

111

Mean execution time per instruction For program i

Arithmetic mean execution timeper instruction

Page 20: Performance and Scalability Class Ppt

20

Harmonic mean performance

m

ii

ah

R

mTR

1

)/1(/1

m

iii

h

RfR

1

*

)/(

1

Harmonic mean execution rate

Weighted harmonic mean execution rate

-corresponds to total # of operations divided by the total time (closest to the real performance)

Page 21: Performance and Scalability Class Ppt

Geometric Mean

A geometric mean of n terms is the nth root of the product of the n terms.Like the arithmetic mean, the geometric mean of a set of execution rates does not have an inverse relationship with the total execution time of the programs.(Geometric mean has been advocated for use with normalized performance numbers for comparison with a reference machine.)

Page 22: Performance and Scalability Class Ppt

Harmonic Mean

Instead of using arithmetic or geometric mean, we use the harmonic mean execution rate, which is just the inverse of the arithmetic mean of the execution time (thus guaranteeing the inverse relation not exhibited by the other means).

m

i i

hR

mR

1/1

Page 23: Performance and Scalability Class Ppt

Weighted Harmonic Mean

If we associate weights fi with the benchmarks, then we can compute the weighted harmonic mean:

m

i ii

hRf

mR

1/

Page 24: Performance and Scalability Class Ppt

Weighted Harmonic Mean SpeedupT1 = 1/R1 = 1 is the sequential execution time on a single processor with rate R1 = 1.

Ti = 1/Ri = 1/i = is the execution time using i processors with a combined execution rate of Ri = i.Now suppose a program has n execution modes with associated weights f1 … fn. The weighted harmonic mean speedup is defined as:

*

1

1

1/

/n

i ii

S T Tf R

* *1/ hT R(weighted arithmetic mean execution time)

Page 25: Performance and Scalability Class Ppt

25

Harmonic Mean Speedup Performance

Page 26: Performance and Scalability Class Ppt

Amdahl’s LawAssume Ri = i, and w (the weights) are (, 0, …, 0, 1-).

Basically this means the system is used sequentially (with probability ) or all n processors are used (with probability 1- ).This yields the speedup equation known as Amdahl’s law:

1 1n

nS

n

The implication is that the best speedup possible is 1/ , regardless of n, the number of processors.

Page 27: Performance and Scalability Class Ppt

27

Illustration of Amdahl Effect

n = 100

n = 1,000

n = 10,000Speedup

Processors

Page 28: Performance and Scalability Class Ppt

28

Example 1

95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs?

9.58/)05.01(05.0

1

Page 29: Performance and Scalability Class Ppt

System Efficiency – 1

Assume the following definitions:O (n) = total number of “unit operations” performed by an n-processor system in completing a program P.T (n) = execution time required to execute the program P on an n-processor system.

O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by a constant factor.If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s).

Page 30: Performance and Scalability Class Ppt

30

Example 2

5% of a parallel program’s execution time is spent within inherently sequential code.The maximum speedup achievable by this program, regardless of how many PEs are used, is

2005.0

1

/)05.01(05.0

1lim

pp

Page 31: Performance and Scalability Class Ppt

31

Pop QuizAn oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors?

Answer: 1/(0.2 + (1 - 0.2)/8) 3.3

Page 32: Performance and Scalability Class Ppt

System Efficiency – 2Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as

S (n) = T (1) / T (n)

Recall that we expect T (n) < T (1), so S (n) 1.System efficiency is defined as

E (n) = S (n) / n = T (1) / ( n T (n) )It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n E (n) 1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized.

Page 33: Performance and Scalability Class Ppt

RedundancyThe redundancy in a parallel computation is defined as

R (n) = O (n) / O (1)What values can R (n) obtain?

R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of processors, n. This is the ideal case.R (n) = n when all processors performs the same number of operations as when only a single processor is used; this implies that n completely redundant computations are performed!

The R (n) figure indicates to what extent the software parallelism is carried over to the hardware implementation without having extra operations performed.

Page 34: Performance and Scalability Class Ppt

System Utilization

System utilization is defined asU (n) = R (n) E (n) = O (n) / ( n T

(n) )It indicates the degree to which the system resources were kept busy during execution of the program. Since 1 R (n) n, and 1 / n E (n) 1, the best possible value for U (n) is 1, and the worst is 1 / n.1 / n E (n) U (n) 11 R (n) 1 / E (n) n

Page 35: Performance and Scalability Class Ppt

Quality of Parallelism

The quality of a parallel computation is defined as

Q (n) = S (n) E (n) / R (n) = T 3 (1) / ( n T 2 (n) O (n) )

This measure is directly related to speedup (S) and efficiency (E), and inversely related to redundancy (R).The quality measure is bounded by the speedup (that is, Q (n) S (n) ).

Page 36: Performance and Scalability Class Ppt

Standard Industry Performance Measures

MIPS and Mflops, while easily understood, are poor measures of system performance, since their interpretation depends on machine clock cycles and instruction sets. For example, which of these machines is faster?

a 10 MIPS CISC computera 20 MIPS RISC computer

It is impossible to tell without knowing more details about the instruction sets on the machines. Even the question, “which machine is faster,” is suspect, since we really need to say “faster at doing what?”

Page 37: Performance and Scalability Class Ppt

Doing What?To answer the “doing what?” question, several standard programs are frequently used.

The Dhrystone benchmark uses no floating point instructions, system calls, or library functions. It uses exclusively integer data items. Each execution of the entire set of high-level language statements is a Dhrystone, and a machine is rated as having a performance of some number of Dhrystones per second (sometimes reported as KDhrystones/sec).The Whestone benchmark uses a more complex program involving floating point and integer data, arrays, subroutines with parameters, conditional branching, and library functions. It does not, however, contain any obviously vectorizable code.

The performance of a machine on these benchmarks depends in large measure on the compiler used to generate the machine language. [Some companies have, in the past, actually “tweaked” their compilers to specifically deal with the benchmark programs!]

Page 38: Performance and Scalability Class Ppt

What’s VAX Got To Do With It?The Digital Equipment VAX-11/780 computer for many years has been commonly agreed to be a 1-MIPS machine (whatever that means).Since the VAX-11/780 also has a rating of about 1.7 KDhrystrones, this gives a method whereby a relative MIPS rating for any other machine can be derived: just run the Dhrystone benchmark on the other machine, divide by 1.7K, and you then obtain the relative MIPS rating for that machine (sometimes also called VUPs, or VAX units of performance).

Page 39: Performance and Scalability Class Ppt

Other MeasuresTransactions per second (TPS) is a measure that is appropriate for online systems like those used to support ATMs, reservation systems, and point of sale terminals. The measure may include communication overhead, database search and update, and logging operations. The benchmark is also useful for rating relational database performance.KLIPS is the measure of the number of logical inferences per second that can be performed by a system, presumably to relate how well that system will perform at certain AI applications. Since one inference requires about 100 instructions (in the benchmark), a rating of 400 KLIPS is roughly equivalent to 40 MIPS.

Page 40: Performance and Scalability Class Ppt

40

Parallel Processing Applications

Drug designHigh-speed civil transportOcean modelingOzone depletion researchAir pollutionDigital anatomy

Page 41: Performance and Scalability Class Ppt

41

Application Models for Parallel Computers

Fixed-load modelConstant workload

Fixed-time modelDemands constant program execution time

Fixed-memory modelLimited by the memory bound

Page 42: Performance and Scalability Class Ppt

42

Algorithm Characteristics

Deterministic vs. nondeterministicComputational granularityParallelism profileCommunication patterns and synchronization requirementsUniformity of operationsMemory requirement and data structures

Page 43: Performance and Scalability Class Ppt

43

Isoefficiency Concept

Relates workload to machine size n needed to maintain a fixed efficiency

The smaller the power of n, the more scalable the system

),()(

)(

nshsw

swE

workload

overhead

Page 44: Performance and Scalability Class Ppt

44

Isoefficiency Function

To maintain a constant E, w(s) should grow in proportion to h(s,n)

C = E/(1-E) is constant for fixed E

),(1

)( nshE

Esw

),()( nshCnfE

Page 45: Performance and Scalability Class Ppt

45

The Isoefficiency Metric(Terminology)

Parallel system – a parallel program executing on a parallel computerScalability of a parallel system - a measure of its ability to increase performance as number of processors increasesA scalable system maintains efficiency as processors are addedIsoefficiency - a way to measure scalability

Page 46: Performance and Scalability Class Ppt

46

Notation Needed for the Isoefficiency Relation

n data sizep number of processorsT(n,p)Execution time, using p processors(n,p) speedup(n) Inherently sequential computations(n) Potentially parallel computations(n,p)Communication operations (n,p) Efficiency

Note: At least in some printings, there appears to be a misprint on page 170 in Quinn’s textbook, with (n) being sometimes replaced with (n). To correct, simply replace each with .

Page 47: Performance and Scalability Class Ppt

47

Isoefficiency Concepts

T0(n,p) is the total time spent by processes doing work not done by sequential algorithm.T0(n,p) = (p-1)(n) + p(n,p)

We want the algorithm to maintain a constant level of efficiency as the data size n increases. Hence, (n,p) is required to be a constant.Recall that T(n,1) represents the sequential execution time.

Page 48: Performance and Scalability Class Ppt

48

The Isoefficiency RelationSuppose a parallel system exhibits efficiency (n,p).

Define

In order to maintain the same level of efficiency as the number of processors increases, n must be increased so that the following inequality is satisfied.

),()()1(),(T

),(1

),(

0 pnpnppn

pn

pnC

),()1,( 0 pnCTnT

Page 49: Performance and Scalability Class Ppt

49

Speedup Performance Laws

Amdahl’s lawfor fixed workload or fixed problem size

Gustafson’s lawfor scaled problems (problem size increases with increased machine size)

Speedup modelfor scaled problems bounded by memory capacity

Page 50: Performance and Scalability Class Ppt

50

Amdahl’s Law

As # of processors increase, the fixed load is distributed to more processorsMinimal turnaround time is primary goalSpeedup factor is upper-bounded by a sequential bottleneckTwo cases: DOP < n DOP n

Page 51: Performance and Scalability Class Ppt

51

Fixed Load Speedup Factor

Case 1: DOP > n Case 2: DOP < n

n

i

i

Wit i

i )(

m

i

i

n

i

i

WnT

1

)(

i

Wtnt i

ii )()(

m

i

i

m

iii

n

ni

i

W

W

nT

TS

1

)(

)1(

Page 52: Performance and Scalability Class Ppt

52

Gustafson’s Law

With Amdahl’s Law, the workload cannot scale to match the available computing power as n increasesGustafson’s Law fixes the time, allowing the problem size to increase with higher nNot saving time, but increasing accuracy

Page 53: Performance and Scalability Class Ppt

53

Fixed-time Speedup

As the machine size increases, have increased workload and new profileIn general, Wi’ > Wi for 2 i m’ and W1’ = W1

Assume T(1) = T’(n)

Page 54: Performance and Scalability Class Ppt

54

Gustafson’s Scaled Speedup

)(1

'

1

nQn

i

i

WW

m

i

im

ii

n

nm

ii

m

ii

n WW

nWW

W

WS

1

1

1

1

'

'

'

Page 55: Performance and Scalability Class Ppt

55

Memory Bounded Speedup Model

Idea is to solve largest problem, limited by memory spaceResults in a scaled workload and higher accuracyEach node can handle only a small subproblem for distributed memoryUsing a large # of nodes collectively increases the memory capacity proportionally

Page 56: Performance and Scalability Class Ppt

56

Fixed-Memory Speedup

Let M be the memory requirement and W the computational workload: W = g(M)g*(nM)=G(n)g(M)=G(n)Wn

nWnGW

WnGW

nQni

iW

WS

n

n

m

i

i

m

ii

n /)(

)(

)( 1

1

1

*1

*

**

*

Page 57: Performance and Scalability Class Ppt

57

Relating Speedup Models

G(n) reflects the increase in workload as memory increases n timesG(n) = 1 : Fixed problem size (Amdahl)G(n) = n : Workload increases n times when memory increased n times (Gustafson)G(n) > n : workload increases faster than memory than the memory requirement

Page 58: Performance and Scalability Class Ppt

58

Scalability Metrics

Machine size (n) : # of processorsClock rate (f) : determines basic m/c cycleProblem size (s) : amount of computational workload. Directly proportional to T(s,1).CPU time (T(s,n)) : actual CPU time for executionI/O demand (d) : demand in moving the program, data, and results for a given run

Page 59: Performance and Scalability Class Ppt

59

Interpreting Scalability Function

Number of processors

Mem

ory

need

ed p

er p

roce

ssor

Cplogp

Cp

Clogp

C

Memory Size

Can maintainefficiency

Cannot maintainefficiency

Page 60: Performance and Scalability Class Ppt

60

Scalability Metrics

Memory capacity (m) : max # of memory words demandedCommunication overhead (h(s,n)) : amount of time for interprocessor communication, synchronization, etc. Computer cost (c) : total cost of h/w and s/w resources requiredProgramming overhead (p) : development overhead associated with an application program

Page 61: Performance and Scalability Class Ppt

61

Speedup and Efficiency

The problem size is the independent parameter

n

nsSnsE

),(),(

),(),(

)1,(),(

nshnsT

sTnsS

Page 62: Performance and Scalability Class Ppt

62

Scalable Systems

Ideally, if E(s,n)=1 for all algorithms and any s and n, system is scalablePractically, consider the scalability of a m/c

),(

),(

),(

),(),(

nsT

nsT

nsS

nsSns I

I

Page 63: Performance and Scalability Class Ppt

63

Summary (2)

Some factors preventing linear speedup?Serial operationsCommunication operationsProcess start-upImbalanced workloadsArchitectural limitations