performance and scalability class ppt

Principles of Scalable Performance

EENG-630 Chapter 3 2

Principles of Scalable Performance

Performance measuresSpeedup lawsScalability principlesScaling up vs. scaling down

3

Performance metrics and measures

Parallelism profilesAsymptotic speedup factorSystem efficiency, utilization and qualityStandard performance measures

4

Degree of parallelism

Reflects the matching of software and hardware parallelismDiscrete time function – measure, for each time period, the # of processors usedParallelism profile is a plot of the DOP as a function of timeIdeally have unlimited resources

Degree of Parallelism

The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time.DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions.A plot of DOP vs. time is called a parallelism profile.

6

Factors affecting parallelism profiles

Algorithm structureProgram optimizationResource utilizationRun-time conditionsRealistically limited by # of available processors, memory, and other nonprocessor resources

Average Parallelism - 1Assume the following:

n homogeneous processorsmaximum parallelism in a profile is mIdeally, n >> m, the computing capacity of a processor, is something like MIPS or Mflops w/o regard for memory latency, etc.i is the number of processors busy in an observation period (e.g. DOP = i )W is the total work (instructions or computations) performed by a programA is the average parallelism in the program

Average Parallelism - 2

2

1

)(t

t

dttDOPW

m

iitiW

1

where ti = total time that DOP = i, and

m

ii ttt

112

Total amount of work performed is proportional to the area under the profile curve

Average Parallelism - 3

2

1

)(1

12

t

t

dttDOPtt

A

m

ii

m

ii ttiA

11

/

10

Example: parallelism profile and average parallelism

Available Parallelism

Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle).But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).

Basic Blocks

A basic block is a sequence or block of instructions with one entry and one exit.Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of registers utilized in the block).Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in typical code).

Asymptotic Speedup - 1

m

iiWW

1

ii tiW (work done when DOP = i)

(relates sum of Wi terms to W)

kWkt ii /)( (execution time with k processors)

iWt ii /)( (for 1 i m)

Asymptotic Speedup - 2

m

i

m

i

ii

WtT

1 1

)1()1( (resp. time w/ 1 proc.)

m

i

m

i

ii i

WtT

1 1

)()( (resp. time w/ proc.)

AiW

W

T

TS m

i i

m

i i

1

1

/)(

)1((in the ideal case)

15

Performance measures

Consider n processors executing m programs in various modesWant to define the mean performance of these multimode computers:

Arithmetic mean performanceGeometric mean performanceHarmonic mean performance

Mean Performance Calculation

We seek to obtain a measure that characterizes the mean, or average, performance of a set of benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).We may also wish to associate weights with these programs to emphasize these different modes and yield a more meaningful performance measure.

17

Arithmetic mean performance

m

i

ia mRR1

/

m

iiia RfR

1

* )(

Arithmetic mean execution rate(assumes equal weighting)

Weighted arithmetic mean execution rate

-proportional to the sum of the inverses of execution times

18

Geometric mean performance

m

i

mig RR

1

/1

m

i

fig

iRR1

*

Geometric mean execution rate

Weighted geometric mean execution rate

-does not summarize the real performance since it does not have the inverse relation with the total time

19

Harmonic mean performance

ii RT /1

m

i i

m

iia Rm

Tm

T11

111

Mean execution time per instruction For program i

Arithmetic mean execution timeper instruction

20

Harmonic mean performance

m

ii

ah

R

mTR

1

)/1(/1

m

iii

h

RfR

1

*

)/(

1

Harmonic mean execution rate

Weighted harmonic mean execution rate

-corresponds to total # of operations divided by the total time (closest to the real performance)

Geometric Mean

A geometric mean of n terms is the nth root of the product of the n terms.Like the arithmetic mean, the geometric mean of a set of execution rates does not have an inverse relationship with the total execution time of the programs.(Geometric mean has been advocated for use with normalized performance numbers for comparison with a reference machine.)

Harmonic Mean

Instead of using arithmetic or geometric mean, we use the harmonic mean execution rate, which is just the inverse of the arithmetic mean of the execution time (thus guaranteeing the inverse relation not exhibited by the other means).

m

i i

hR

mR

1/1

Weighted Harmonic Mean

If we associate weights fi with the benchmarks, then we can compute the weighted harmonic mean:

m

i ii

hRf

mR

1/

Weighted Harmonic Mean SpeedupT1 = 1/R1 = 1 is the sequential execution time on a single processor with rate R1 = 1.

Ti = 1/Ri = 1/i = is the execution time using i processors with a combined execution rate of Ri = i.Now suppose a program has n execution modes with associated weights f1 … fn. The weighted harmonic mean speedup is defined as:

*

1

1

1/

/n

i ii

S T Tf R

* *1/ hT R(weighted arithmetic mean execution time)

25

Harmonic Mean Speedup Performance

Amdahl’s LawAssume Ri = i, and w (the weights) are (, 0, …, 0, 1-).

Basically this means the system is used sequentially (with probability ) or all n processors are used (with probability 1- ).This yields the speedup equation known as Amdahl’s law:

1 1n

nS

n

The implication is that the best speedup possible is 1/ , regardless of n, the number of processors.

27

Illustration of Amdahl Effect

n = 100

n = 1,000

n = 10,000Speedup

Processors

28

Example 1

95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs?

9.58/)05.01(05.0

1

System Efficiency – 1

Assume the following definitions:O (n) = total number of “unit operations” performed by an n-processor system in completing a program P.T (n) = execution time required to execute the program P on an n-processor system.

O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by a constant factor.If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s).

30

Example 2

5% of a parallel program’s execution time is spent within inherently sequential code.The maximum speedup achievable by this program, regardless of how many PEs are used, is

2005.0

1

/)05.01(05.0

1lim

pp

31

Pop QuizAn oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors?

Answer: 1/(0.2 + (1 - 0.2)/8) 3.3

System Efficiency – 2Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as

S (n) = T (1) / T (n)

Recall that we expect T (n) < T (1), so S (n) 1.System efficiency is defined as

E (n) = S (n) / n = T (1) / ( n T (n) )It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n E (n) 1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized.

RedundancyThe redundancy in a parallel computation is defined as

R (n) = O (n) / O (1)What values can R (n) obtain?

R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of processors, n. This is the ideal case.R (n) = n when all processors performs the same number of operations as when only a single processor is used; this implies that n completely redundant computations are performed!

The R (n) figure indicates to what extent the software parallelism is carried over to the hardware implementation without having extra operations performed.

System Utilization

System utilization is defined asU (n) = R (n) E (n) = O (n) / ( n T

(n) )It indicates the degree to which the system resources were kept busy during execution of the program. Since 1 R (n) n, and 1 / n E (n) 1, the best possible value for U (n) is 1, and the worst is 1 / n.1 / n E (n) U (n) 11 R (n) 1 / E (n) n

Quality of Parallelism

The quality of a parallel computation is defined as

Q (n) = S (n) E (n) / R (n) = T 3 (1) / ( n T 2 (n) O (n) )

This measure is directly related to speedup (S) and efficiency (E), and inversely related to redundancy (R).The quality measure is bounded by the speedup (that is, Q (n) S (n) ).

Standard Industry Performance Measures

MIPS and Mflops, while easily understood, are poor measures of system performance, since their interpretation depends on machine clock cycles and instruction sets. For example, which of these machines is faster?

a 10 MIPS CISC computera 20 MIPS RISC computer

It is impossible to tell without knowing more details about the instruction sets on the machines. Even the question, “which machine is faster,” is suspect, since we really need to say “faster at doing what?”

Doing What?To answer the “doing what?” question, several standard programs are frequently used.

The Dhrystone benchmark uses no floating point instructions, system calls, or library functions. It uses exclusively integer data items. Each execution of the entire set of high-level language statements is a Dhrystone, and a machine is rated as having a performance of some number of Dhrystones per second (sometimes reported as KDhrystones/sec).The Whestone benchmark uses a more complex program involving floating point and integer data, arrays, subroutines with parameters, conditional branching, and library functions. It does not, however, contain any obviously vectorizable code.

The performance of a machine on these benchmarks depends in large measure on the compiler used to generate the machine language. [Some companies have, in the past, actually “tweaked” their compilers to specifically deal with the benchmark programs!]

What’s VAX Got To Do With It?The Digital Equipment VAX-11/780 computer for many years has been commonly agreed to be a 1-MIPS machine (whatever that means).Since the VAX-11/780 also has a rating of about 1.7 KDhrystrones, this gives a method whereby a relative MIPS rating for any other machine can be derived: just run the Dhrystone benchmark on the other machine, divide by 1.7K, and you then obtain the relative MIPS rating for that machine (sometimes also called VUPs, or VAX units of performance).

Other MeasuresTransactions per second (TPS) is a measure that is appropriate for online systems like those used to support ATMs, reservation systems, and point of sale terminals. The measure may include communication overhead, database search and update, and logging operations. The benchmark is also useful for rating relational database performance.KLIPS is the measure of the number of logical inferences per second that can be performed by a system, presumably to relate how well that system will perform at certain AI applications. Since one inference requires about 100 instructions (in the benchmark), a rating of 400 KLIPS is roughly equivalent to 40 MIPS.

40

Parallel Processing Applications

Drug designHigh-speed civil transportOcean modelingOzone depletion researchAir pollutionDigital anatomy

41

Application Models for Parallel Computers

Fixed-load modelConstant workload

Fixed-time modelDemands constant program execution time

Fixed-memory modelLimited by the memory bound

42

Algorithm Characteristics

Deterministic vs. nondeterministicComputational granularityParallelism profileCommunication patterns and synchronization requirementsUniformity of operationsMemory requirement and data structures

43

Isoefficiency Concept

Relates workload to machine size n needed to maintain a fixed efficiency

The smaller the power of n, the more scalable the system

),()(

)(

nshsw

swE

workload

overhead

44

Isoefficiency Function

To maintain a constant E, w(s) should grow in proportion to h(s,n)

C = E/(1-E) is constant for fixed E

),(1

)( nshE

Esw

),()( nshCnfE

45

The Isoefficiency Metric(Terminology)

Parallel system – a parallel program executing on a parallel computerScalability of a parallel system - a measure of its ability to increase performance as number of processors increasesA scalable system maintains efficiency as processors are addedIsoefficiency - a way to measure scalability

46

Notation Needed for the Isoefficiency Relation

n data sizep number of processorsT(n,p)Execution time, using p processors(n,p) speedup(n) Inherently sequential computations(n) Potentially parallel computations(n,p)Communication operations (n,p) Efficiency

Note: At least in some printings, there appears to be a misprint on page 170 in Quinn’s textbook, with (n) being sometimes replaced with (n). To correct, simply replace each with .

47

Isoefficiency Concepts

T0(n,p) is the total time spent by processes doing work not done by sequential algorithm.T0(n,p) = (p-1)(n) + p(n,p)

We want the algorithm to maintain a constant level of efficiency as the data size n increases. Hence, (n,p) is required to be a constant.Recall that T(n,1) represents the sequential execution time.

48

The Isoefficiency RelationSuppose a parallel system exhibits efficiency (n,p).

Define

In order to maintain the same level of efficiency as the number of processors increases, n must be increased so that the following inequality is satisfied.

),()()1(),(T

),(1

),(

0 pnpnppn

pn

pnC

),()1,( 0 pnCTnT

49

Speedup Performance Laws

Amdahl’s lawfor fixed workload or fixed problem size

Gustafson’s lawfor scaled problems (problem size increases with increased machine size)

Speedup modelfor scaled problems bounded by memory capacity

50

Amdahl’s Law

As # of processors increase, the fixed load is distributed to more processorsMinimal turnaround time is primary goalSpeedup factor is upper-bounded by a sequential bottleneckTwo cases: DOP < n DOP n

51

Fixed Load Speedup Factor

Case 1: DOP > n Case 2: DOP < n

n

i

i

Wit i

i )(

m

i

i

n

i

i

WnT

1

)(

i

Wtnt i

ii )()(

m

i

i

m

iii

n

ni

i

W

W

nT

TS

1

)(

)1(

52

Gustafson’s Law

With Amdahl’s Law, the workload cannot scale to match the available computing power as n increasesGustafson’s Law fixes the time, allowing the problem size to increase with higher nNot saving time, but increasing accuracy

53

Fixed-time Speedup

As the machine size increases, have increased workload and new profileIn general, Wi’ > Wi for 2 i m’ and W1’ = W1

Assume T(1) = T’(n)

54

Gustafson’s Scaled Speedup

)(1

'

1

nQn

i

i

WW

m

i

im

ii

n

nm

ii

m

ii

n WW

nWW

W

WS

1

1

1

1

'

'

'

55

Memory Bounded Speedup Model

Idea is to solve largest problem, limited by memory spaceResults in a scaled workload and higher accuracyEach node can handle only a small subproblem for distributed memoryUsing a large # of nodes collectively increases the memory capacity proportionally

56

Fixed-Memory Speedup

Let M be the memory requirement and W the computational workload: W = g(M)g*(nM)=G(n)g(M)=G(n)Wn

nWnGW

WnGW

nQni

iW

WS

n

n

m

i

i

m

ii

n /)(

)(

)( 1

1

1

*1

*

**

*

57

Relating Speedup Models

G(n) reflects the increase in workload as memory increases n timesG(n) = 1 : Fixed problem size (Amdahl)G(n) = n : Workload increases n times when memory increased n times (Gustafson)G(n) > n : workload increases faster than memory than the memory requirement

58

Scalability Metrics

Machine size (n) : # of processorsClock rate (f) : determines basic m/c cycleProblem size (s) : amount of computational workload. Directly proportional to T(s,1).CPU time (T(s,n)) : actual CPU time for executionI/O demand (d) : demand in moving the program, data, and results for a given run

59

Interpreting Scalability Function

Number of processors

Mem

ory

need

ed p

er p

roce

ssor

Cplogp

Cp

Clogp

C

Memory Size

Can maintainefficiency

Cannot maintainefficiency

60

Scalability Metrics

Memory capacity (m) : max # of memory words demandedCommunication overhead (h(s,n)) : amount of time for interprocessor communication, synchronization, etc. Computer cost (c) : total cost of h/w and s/w resources requiredProgramming overhead (p) : development overhead associated with an application program

61

Speedup and Efficiency

The problem size is the independent parameter

n

nsSnsE

),(),(

),(),(

)1,(),(

nshnsT

sTnsS

62

Scalable Systems

Ideally, if E(s,n)=1 for all algorithms and any s and n, system is scalablePractically, consider the scalability of a m/c

),(

),(

),(

),(),(

nsT

nsT

nsS

nsSns I

I

63

Summary (2)

Some factors preventing linear speedup?Serial operationsCommunication operationsProcess start-upImbalanced workloadsArchitectural limitations

performance and scalability class ppt

Documents

relative mips

sequential

problem size

basic blocks

execution

parallel version

parallel computation

data size