lecture4 performance evaluation 2011 (2)

8/3/2019 Lecture4 Performance Evaluation 2011 (2)

1/34

ELEC2300 Computer Organization

Lecture 4: Performance

EvaluationProfessor George Yuan

Office: Rm. 2527Email:[email protected]

Note: some of the slides are adapted from Computer Organization and Design.Copyright 1998 Morgan Kaufmann Publishers and Notes of Prof. Pattersons CS152 Class, Copyright 1997 UCB .


2/34

ELEC152 Computer Organization Fall 2011 Page 2

OUTLINE

What is the computer performance?

How to evaluate the performance?


3/34


Which of these airplanes has the best performance?

Airplane Passengers Range (mi) Speed (mph)

Boeing 737-100 101 630 598Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544

Time to perform the task (Execution Time) execution time, response time, latency

Tasks per day, hour, week, sec, ns. .. throughput , bandwidth

Latency and throughput often are in opposition

4 types of airplanes fly between Hong Kong & Shanghai(distance: D mi.)

S D

L

C S D

C D

ST

11


4/34


Example

Execution time of Concorde vs. 747:

Concorde is 1350 mph / 610 mph = 2.2 times fasterThroughput of Concorde vs. 747:

Boeing is 286700 pmph / 178200 pmph = 1.6 timesfaster (470*610=286700, 132*1350=178200)

Conclusions:Concorde is 2.2 times faster in terms of flying time .747 is 1.6 times faster in terms of throughput .


5/34


Execution Time vs. Throughput

Execution time

How long does it take for my job to run?

How long does it take to execute a job?

How long must I wait for the database query?Throughput :

How many tasks can the machine run at once?

What is the average execution rate?

How much work is getting done?Computer upgrade:

1. P3 -> P42. 1 P3 -> 2 P3We will focus primarily on execution time for a

single job .


6/34ELEC152 Computer Organization Fall 2011 Page 6

Definitions

For computer study,

" X is n times faster than Y" means

Problem:machine A runs a program in 20 seconds (1 program/20sec)machine B runs the same program in 25 seconds (1program/25 sec)

XX timeexecutioneperformanc

1

Y

X Y timeexecutioneperformanceperformancn

Xtimeexecution



Elapsed time or response timecount everything (disk and memory accesses, I/O , etc.)

a useful number, but often not good for comparison purposesCPU time

Does not count I/O or time spent running other programscan be broken up into system time, and user time

Our focus: user CPU timetime spent executing the lines of code that are "in" our programSystem CPU time : time the CPU spends executing system(kernal) code in order to run your program, such as, reading files,moving information into and out of virtual memory, etc.

Execution Time

XX timeuser CPUeperformanc

1



CPU Time Measurement: Clock CyclesInstead of reporting execution time in seconds, we oftenuse cycles

Processor runs machine instructions based on clockclock cycle time

clock rate (frequency) = cycles per second (1 Hz. = 1cycle/sec)

A 200 Mhz. clock cycle time is

cycle

seconds

program

cycles

program

seconds

time



Relating the Metrics

CPU time for a programCPU time = CPU clock cycles * clock cycle time

= CPU clock cycles/clock rate

Common ways to improve performance

(i.e. shorten CPU execution time):Reduce number of required CPU clock cycles for

a programShorten clock cycle time (i.e. increase clock rate)



Example-Problem

Description :A program takes 10 seconds to run on a 400 MHzmachine (computer A). We want to design a fastermachine (computer B) that can run the same programin 6 seconds.The increase in clock rate affects the rest of the CPUdesign, causing machine B to require 1.2 times asmany clock cycles as machine A for the program.

Problem to solve :What clock rate should machine B have?



Example - Answer


12/34


Cycle Number Calculation

CPU time for a programCPU time = CPU clock cycles * clock cycle time

= CPU clock cycles/clock rateprogram

assembly program

machine instructions

ISA

compiler

assembler

compiler Instruction #

clock cycles/instruction (CPI)

Cycle # = Instruction # CPI

processor


13/34


Cycles Per Instruction

Wrong assumption:# of CPU clock cycles in a program = # of instructions in theprogram,

Actual situationFor some processors, some instructions may take more cyclesthan the others:

E.g. multiplication takes more cycles than addition Floating point operations takes more cycles than integer

operations Memory access takes more cycles than accessing registers

Conclusion: not all instructions require the same # of cycles toexecute.

Cycle per instructions (CPI) an average number of

clock cycles that each instruction in a program takes toexecute.


14/34


Cycles Per Instruction (CPI)

Definition (for a given program):

CPI = (CPU clock cycles)/(instruction count)

A program has the same instruction count on twodifferent implementations of the same instruction set

architecture, but it may have different CPIs (because aninstruction may require different numbers of clock cycleson different implementations). If the number of clockcycles for a program is known, knowing either theinstruction count or the CPI can determine the other.CPI provides a measure for comparing implementations.Instruction count can be measured using software tools

or simulators.


15/34


Cycles Per Instruction

Let there be n different instruction classes(with different CPIs). For a given program,suppose we know:

CPI i = CPI for instruction class iC i = # of instruction of class I

CPU clock cycles = CPI * instruction count. Itcan be generalized to

n

i

n

i

iii

n

iii

C C CPI CPI and

C CPI cyclesclock CPU

1 1

1

/ )(

)(__


16/34


Suppose we have two implementations of thesame instruction set architecture (ISA)

For some program, machine A has a clock cycletime of 1 ns (1 GHz) and a CPI of 2.0. MachineB has a clock cycle time of 2 ns (500MHz) and a

CPI of 1.2. Which machine is faster for thisprogram, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?

CPI Example


17/34


Example - Solution


18/34


Relating the metrics

For a given program X running on a machine A

The only complete and reliable measure is CPU executiontimeOther measures are unreliable. E.g. changing theinstruction set to lower the instruction count may lead to alarger CPI or an organization with a slower clock rate.Either case can offset the improvement in instruction count.

= # of instructions# of instructionsa programa program

secondclock clock

# of clocks# of clocks# of instructions# of instructions

* *

= instruction count * CPI * clock cycle time

secondssecondsprogramprogram

= instruction count * CPI / clock rate

TimeTime =


19/34


Example Comparing Code Segments

Description

A particular machine has the following hardware facts:

For a given C++ statement, a compiler designer considers twocode sequences with the following instruction counts:

Problem to solve

Which code sequence executes the most instructions? Which isfaster? What is the CPI for each sequence?

Instruction class CPI for this instruction classA 1B 2C 3

Code sequenceInstruction counts for instruction classes

A B C1 2 1 22 4 1 1


20/34


Example - Answer


21/34


A misleading measure - MIPS

There are some performance measures that arefamous among computer manufacturers andsellers but are misleading !

MIPS (million instructions per second)(meaningless indication of processor speed)

MIPS = (instruction count)/(execution time * 10 6)MIPS depends on Instruction set (instructions have different capabilities) Program

MIPS can vary inversely with performancePeak performance


22/34


Some Processors in MIPSProcessor IPS Year

Motorola 68000 1MIPS @ 8MHz 1979Intel 386DX 8.5MIPS @ 25MHz 1988

Intel 486DX 54MIPS @ 66MHz 1992

PowerPC G2 35MIPS @ 33MHz 1994Intel Pentium Pro 541MIPS @ 200MHz 1996

ARM 7500FE 35.9MIPS @ 40MHz 1996

PowerPC G3 525MIPS @ 233MHz 1997

Zilog eZ80 80MIPS @ 50MHz 1999

Intel Pentium III 1354MIPS @ 500MHz 1999

AMD Athlon 3561MIPS @ 1.2GHz 2000

Pentium 4 9726MIPS @ 3.2GHz 2003ARM Cortex A8 2000MIPS @ 1.0GHz 2005

Xbox360 IBM Xenon Triple Core 6400MIPS @ 3.2GHz 2005

AMD Athlon 64 3800+ X2(Dual Core) 14564MIPS @ 2.0GHz 2005Intel Core2 Extreme QX6700 57063MIPS @ 3.33GHz 2006


23/34


Another misleading measure - MFLOPS

MFLOPS (million floating-point operations per second):MFLOPS =

(# of floating point operations)/(execution time * 10 6)

MFLOPS considers only floating-point operations(addition, subtraction, multiplication, or divisionoperation applied to a number in a single or doubleprecision floating-point representation).MFLOPS depends on:

Floating-point operation(e.g., addition and multiplication differ in complexity)

ProgramMeaningless if there is little or no floating-pointarithmetic.


24/34


Two different compilers are being tested for a 100 MHz. machinewith three different classes of instructions: Class A, Class B, andClass C, which require one, two, and three cycles (respectively).Both compilers are used to produce code for a large piece ofsoftware.

The first compiler's code uses 5 million Class A instructions, 1million Class B instructions, and 1 million Class C instructions.The second compiler's code uses 10 million Class A instructions,1 million Class B instructions, and 1 million Class C instructions.

What are the execution times for each sequence?

What is the MIPS index for this processor based on the two testingsequence?

MIPS example


25/34


Some related terminology:clock, clock cycle, cycle

clock cycle time, cycle time (seconds, us, ns)

clock rate, cycle rate (Hz, MHz)

CPI (cycles per instruction)

MIPS (millions of instructions per second)

Performance is determined by the execution time

Execution time calculation:

Summary

= instruction count * CPI * clock cycle time= instruction count * CPI / clock rate

Execution TimeExecution Time


26/34


OUTLINE

What is the computer performance?

How to evaluate the performance?


27/34


Execution time calculation:

Benchmark: a set of specially designed programs to test theperformance of a computer

Performance best determined by running a real application

Benchmarks are application specificCPU performance, graphics, high-performance computing, object-

oriented computing, Java applications, client-server models, mailsystems, file systems, Web servers.

SPEC (System Performance Evaluation Cooperative)

companies have agreed on a set of real program and inputsvaluable indicator of computer performance

Processor (ISA implementation) + compiler

Benchmarks

= instruction count * CPI * clock cycle time

= instruction count * CPI / clock rate

Execution TimeExecution Time


28/34


SPEC 89

Compiler enhancements and performance

0

10 0

20 0

30 0

40 0

50 0

60 0

70 0

80 0

tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc

BenchmarkCompi ler

Enhan ced compi ler

S P E C p e r f o r m a n c e r a

t i o


29/34


SPEC ratioReference: Sun Ultra 5_10 with a 300MHzprocessor

CINT2000, CFP2000

Geometric mean of SPEC ratios

SPEC CPU2000


30/34


SPEC CPU2000 Benchmarks


31/34


SPEC CPU2000 ratings


32/34


Execution Time After Improvement =Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )

Example:"Suppose a program runs in 100 seconds on a machine, withmultiplication responsible for 80 seconds of this time. How much do we

have to improve the speed of multiplication if we want the program to run 4times faster?"

How about making the program 5 times faster?

Principle: Make the common case fast

Amdahl's Law


33/34


Suppose we enhance a machine making all floating-point instructionsfive times faster. If the execution time of some benchmark before thefloating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions?

We are looking for a benchmark to show off the new floating-pointunit described above, and want the overall benchmark to show a

speedup of 3. One benchmark we are considering runs for 100seconds with the old floating-point hardware. How much of theexecution time would floating-point instructions have to account forin this program in order to yield our desired speedup on this

benchmark?

Example


34/34

Performance is specific to a particular programTotal execution time is a consistent summary of performance

For a given architecture performance increases comefrom:increases in clock rate (without adverse CPI affects)improvements in processor organization that lower CPIcompiler enhancements that lower CPI and/or instruction count

Pitfall: expecting improvement in one aspect of amachines performance to affect the total performance

You should not always believe everything you read!Read carefully!

Remember

lecture4 performance evaluation 2011 (2)

Documents